1 The Retry Guide

9.0

1 The Retry Guide🔗ℹ

This guide is intended for programmers who are familiar with Racket but new to working with retryers. It contains a description of the high level concepts associated with the retry library, as well as examples and use cases. For a complete description of the retry API, see The Retry Reference.

1.1 Intro to Retryers🔗ℹ

Sometimes, code fails. And sometimes, those failures are unpredictable - they might happen again, they might not. A classic example is attempting to open an internet connection - there’s no way for code to know whether it will succeed or fail due to network issues and machine failures. In a world of such uncertainty, we must learn to handle temporary failure gracefully.

One way to handle temporary failures is by retrying. There are many different ways to retry operations, but most include some combination of waiting for some amount of time, logging information about the failure, and limiting the maximum number of retry attempts. We can define a strategy for retrying by creating a retryer. Retryers are made with the retryer procedure by combining two other procedures - one that defines when a failing operation should be retried, and one that defines how to handle the failure. Then, we can call a possibly-failing procedure with retries by using call/retry.

First, lets define a helper called make-flaky-procedure that returns a thunk (a procedure accepting no arguments, like those created by thunk) that fails with an exception the first few times its called. This will help us explore strategies for retrying calls to flaky code.

Examples:

> (struct exn:fail:flaky exn:fail () #:transparent)
> (define (make-flaky-procedure #:num-failures num-failures)
    (define num-calls (box 0))
    (thunk
     (when (< (unbox num-calls) num-failures)
       (set-box! num-calls (add1 (unbox num-calls)))
       (raise (exn:fail:flaky "not good enough!" (current-continuation-marks))))
     'success))

The flaky procedures returned by make-flaky-procedure store the number of times they’ve been called in a box, and increment that number before throwing an exn:fail:flaky when they’re called:

Examples:

> (define example-flaky-proc (make-flaky-procedure #:num-failures 3))
> (example-flaky-proc)
not good enough!

Once our flaky procedure is called for the fourth time, it starts returning 'success:

Examples:

> (example-flaky-proc)
not good enough!
> (example-flaky-proc)
not good enough!
> (example-flaky-proc)
'success

Now that we have a way to construct flaky code, lets create a retryer and use call/retry to automatically call flaky procedures multiple times:

Examples:

> (define my-retryer
    (retryer #:should-retry? (λ (raised num-previous-retries)
                               #t)
             #:handle (λ (raised num-previous-retries)
                        (printf "Failed attempt ~a, message: ~a\n"
                                (add1 num-previous-retries)
                                (exn-message raised)))))
> (call/retry my-retryer (make-flaky-procedure #:num-failures 3))
Failed attempt 1, message: not good enough!
Failed attempt 2, message: not good enough!
Failed attempt 3, message: not good enough!
'success

A retryer is constructed from two procedures. Both are called during the extent of call/retry when the procedure given to call/retry fails, and both are given the thrown exception and the number of previous retries. The first, the #:should-retry? argument, is called to determine if this retryer wants to retry the failure. Our retryer ignores both arguments and always returns true, indicating that call/retry should keep retrying no matter what happens.

In a real world application, retrying forever can be dangerous and should be considered carefully. At the very least, unlimited retries should be restricted to a very targeted subtype of exn rather than all possible values that could be raised.

The second, the #:handle argument, is called before retrying to perform some arbitrary side-effect to handle the thrown exception. Our retryer prints the message along with how many attempts we’ve made.

Instead of making our own retryers however, this library encourages constructing retryers through composition. Described in the next section Predefined Retryers are the built in retryers provided by retry that implement simple targeted functionality. In the following section Building Complex Retryers, we show how to compose simple retryers into more complex ones that perform multiple operations.

1.2 Predefined Retryers🔗ℹ

While retryers can perform arbitrarily complex operations, most applications only need to perform a few kinds of tasks when retrying. The retry library provides procedures out of the box to create retryers for these tasks. Included are retryers for printing messages, waiting for certain amounts of time, and limiting the number of retries.

1.2.1 Printing Retryers🔗ℹ

The print-exn-retryer procedure takes a function for converting exception messages and the number of retries into a string and constructs a retryer. That retryer uses the given function to print a message with displayln:

Examples:

> (define (retry-failure-to-string msg num-previous-retries)
(format "Failed attempt ~a, message: ~a" (add1 num-previous-retries) msg))
> (define printing-retryer (print-exn-retryer retry-failure-to-string))
> (call/retry printing-retryer
(make-flaky-procedure #:num-failures 3))
Failed attempt 1, message: not good enough!
Failed attempt 2, message: not good enough!
Failed attempt 3, message: not good enough!
'success

Note that only exceptions have messages, but raised values might not necessarily be exceptions. Thus, a retryer created with print-exn-retryer only handles exceptions - any other raised values are raised normally instead of retried.

1.2.2 Limiting Retryers🔗ℹ

The limit-retryer procedure is relatively simple. Given a maximum number of times to retry, the returned retryer only handles thrown values if fewer that that many retries have occurred:

Examples:

> (define at-most-four-retryer (limit-retryer 4))
> (call/retry at-most-four-retryer (make-flaky-procedure #:num-failures 5))
not good enough!
> (call/retry at-most-four-retryer (make-flaky-procedure #:num-failures 3))
'success

1.2.3 Sleeping Retryers🔗ℹ

Often temporary failures are due to factors completely outside our control, where the only recourse is to wait some amount of time for the issue to fix itself. For these cases, we can construct retryers that call sleep to pause between retries. This library uses Gregor period structures to determine how long to sleep for, in contrast to the fractional seconds used when calling sleep directly. The periods used must satisfy time-period?.

The simplest of the sleeping retryers is sleep-const-retryer. Retryers returned by this procedure sleep for a constant amount of time (determined by a period given to sleep-const-retryer) between retries. In the following examples we have altered the behavior of sleep slightly; instead of actually pausing execution, the amount of seconds to sleep for will be printed out and no sleeping will occur. If you evaluate these expressions in a normal Racket REPL, expect long execution times with no output.

Example:

> (call/retry (sleep-const-retryer (minutes 3))
(make-flaky-procedure #:num-failures 3))
Sleeping for 180 seconds...
Sleeping for 180 seconds...
Sleeping for 180 seconds...
'success

A more general form of retrying with pauses is available via sleep-retryer. Unlike sleep-const-retryer, sleep-retryer accepts a procedure that is expected to map the number of previous retries to a time-period? value. This allows sleeping retryers to vary how long they pause between retries. Lets use it to build a retryer that sleeps for a linearly increasing number of minutes:

Examples:

> (define (sleep-amount num-previous-retries)
(minutes (* 5 (add1 num-previous-retries))))
> (call/retry (sleep-retryer sleep-amount)
(make-flaky-procedure #:num-failures 3))
Sleeping for 300 seconds...
Sleeping for 600 seconds...
Sleeping for 900 seconds...
'success

A very common pattern when attempting to open network connections is to sleep between failures with exponential backoff. This means waiting for an exponentially increasing amount of time between failures. Because of its ubiquity, the retry library provides sleep-exponential-retryer for retrying with exponential backoff:

Examples:

> (define exponential-retryer (sleep-exponential-retryer (seconds 10)))
> (call/retry exponential-retryer (make-flaky-procedure #:num-failures 3))
Sleeping for 10 seconds...
Sleeping for 20 seconds...
Sleeping for 40 seconds...
'success

Additionally, three more procedures are provided: sleep-retryer/random, sleep-const-retryer/random, and sleep-exponential-retryer/random. These procedures produce retryers that sleep for random amounts of time, up to a maximum of what their counterpart retryers would sleep for. This is useful for adding jitter: variation in a group of retrying agents that causes them to fall out of synchronization (see the thundering herd problem). Now, lets consider what happens if we take sleep-const-retryer and add some randomness.

Example:

> (call/retry (sleep-const-retryer/random (seconds 30))
(make-flaky-procedure #:num-failures 3))
Sleeping for 6 seconds...
Sleeping for 13 seconds...
Sleeping for 8 seconds...
'success

The unit of the returned time-period? affects the randomness. Returning a minutes period causes a random number of minutes to be chosen as the sleep amount, while returning the equivalent milliseconds period chooses a random amount of milliseconds. Observe the difference between using (minutes 5) and (seconds 300):

Example:

> (call/retry (sleep-const-retryer/random (minutes 5))
(make-flaky-procedure #:num-failures 3))
Sleeping for 180 seconds...
Sleeping for 180 seconds...
Sleeping for 180 seconds...
'success

Example:

> (call/retry (sleep-const-retryer/random (seconds 300))
(make-flaky-procedure #:num-failures 3))
Sleeping for 209 seconds...
Sleeping for 45 seconds...
Sleeping for 247 seconds...
'success

1.3 Building Complex Retryers🔗ℹ

Each of the built-in retryers we’ve seen so far in Predefined Retryers has a single purpose. However, in the real world we often wish to combine many of these behaviors. We may wish for a retryer that retries at most three times, sleeps an increasing amount of time between retries, and prints information about failures to the console. Instead of reaching for our own custom implementations as soon as the going gets tough, we can combine our existing simple retryers together to declaritively construct complex retryers.

1.3.1 Retryer Composition🔗ℹ

The simplest means of combining retryers is retryer-compose. This procedure takes a list of retryers and returns one composed retryer, which calls each given retryer to determine if and how to handle failures. Recall our printing-retryer from Printing Retryers; we can add sleeping with retryer-compose:

Examples:

> (define composed-retryer
(retryer-compose (sleep-const-retryer (seconds 10)) printing-retryer))
> (call/retry composed-retryer (make-flaky-procedure #:num-failures 3))
Failed attempt 1, message: not good enough!
Sleeping for 10 seconds...
Failed attempt 2, message: not good enough!
Sleeping for 10 seconds...
Failed attempt 3, message: not good enough!
Sleeping for 10 seconds...
'success

Note that although printing-retryer is the second argument to retryer-compose, it is called first when a failure is handled. This mimics the behavior of function composition with compose; retryers are called in right-to-left order.

1.3.2 Cyclic Retryers🔗ℹ

In previous sections, we discussed sleep-exponential-retryer and the utility of creating retryers that sleep increasing amounts of time between retries. This can be troublesome if the retries are unbounded; eventually the pauses between retries could reach days or weeks. Using cycle-retryer, we can extend retryers with cyclic behavior so that the number of retries appears to "reset".

Example:

> (call/retry (cycle-retryer (sleep-exponential-retryer (seconds 10)) 4)
(make-flaky-procedure #:num-failures 10))
Sleeping for 10 seconds...
Sleeping for 20 seconds...
Sleeping for 40 seconds...
Sleeping for 80 seconds...
Sleeping for 10 seconds...
Sleeping for 20 seconds...
Sleeping for 40 seconds...
Sleeping for 80 seconds...
Sleeping for 10 seconds...
Sleeping for 20 seconds...
'success

1.3.3 In Depth Example: Cyclic Exponential Backoff with Jittering🔗ℹ

Putting together all the concepts we’ve learned so far, let’s consider how a web server might reliable handle the failure of a database it retrieves information from. It would be most unfortunate if a small temporary network issue caused our website to permanently fail. There’s a host of other constraints as well:

A small hiccup may be resolved in seconds, we shouldn’t be waiting for minutes or hours after our first retry.
Bandwidth is expensive. If the database is down for hours, retrying every few seconds can be costly and lead to network congestion.
Reconnection should occur quickly when the database comes back online.
There may be hundreds or thousands of instances of our website trying to connect, if they all make requests in synchronization the spiky network load can cause failures and overloads.
We should only retry when faced with network errors. Permission and configuration errors are far less likely to resolve themselves, and by retrying forever we mask their existence.

To address these constraints, we’ll combine three main elements in our retry strategy:

Adding a quick test of the raised exception to verify we’re dealing with a network issue instead of some other more-permanent problem.
Exponential backoff with sleep-exponential-retryer. This lets us reconnect quickly in the event of small hiccups, but we won’t make unnecessary requests when faced with a large outage.
Cycling the backoff with cycle-retryer. Using plain exponential backoff can result in large waits to reconnect when the database comes back online. If the time between retries doubles, then a one-hour outage in the database could trigger a two-hour outage in our website: reconnection is attempted just before the database comes back online and fails with the next retry not occurring for another hour. Cycling sets an upper bound on how long we wait between retries.
Adding jitter with sleep-const-retryer/random. When a database outage first occurs, our websites will attempt to reconnect (mostly) in sync with each other. By adding jitter they will gradually fall out of resynchronization, spreading the network load around as we reach the larger periods between retries with our exponential backoff.

By combining these elements (along with print-exn-retryer), we end up with a retryer looking something like this. Note that we’re assuming an exn:fail:flaky is raised instead of exn:fail:network, this helps us demonstrate our retryer.

Examples:

> (define (database-retry-message exn-msg num-previous-retries)
    (format "Failed database connection attempt ~a, message: ~a"
            (add1 num-previous-retries)
            exn-msg))
> (define database-retryer
    (retryer-compose (cycle-retryer (sleep-exponential-retryer (seconds 1)) 8)
                     (sleep-const-retryer/random (seconds 5))
                     (print-exn-retryer database-retry-message)
                     (retryer #:should-retry? (λ (r n) (exn:fail:flaky? r)))))

Lets test out this retryer with different outage scenarios, simulated with make-flaky-procedure.

Examples:

> (call/retry database-retryer (make-flaky-procedure #:num-failures 3))
Failed database connection attempt 1, message: not good enough!
Sleeping for 3 seconds...
Sleeping for 1 seconds...
Failed database connection attempt 2, message: not good enough!
Sleeping for 4 seconds...
Sleeping for 2 seconds...
Failed database connection attempt 3, message: not good enough!
Sleeping for 4 seconds...
Sleeping for 4 seconds...
'success
> (call/retry database-retryer (make-flaky-procedure #:num-failures 10))
Failed database connection attempt 1, message: not good enough!
Sleeping for 4 seconds...
Sleeping for 1 seconds...
Failed database connection attempt 2, message: not good enough!
Sleeping for 0 seconds...
Sleeping for 2 seconds...
Failed database connection attempt 3, message: not good enough!
Sleeping for 2 seconds...
Sleeping for 4 seconds...
Failed database connection attempt 4, message: not good enough!
Sleeping for 0 seconds...
Sleeping for 8 seconds...
Failed database connection attempt 5, message: not good enough!
Sleeping for 3 seconds...
Sleeping for 16 seconds...
Failed database connection attempt 6, message: not good enough!
Sleeping for 3 seconds...
Sleeping for 32 seconds...
Failed database connection attempt 7, message: not good enough!
Sleeping for 0 seconds...
Sleeping for 64 seconds...
Failed database connection attempt 8, message: not good enough!
Sleeping for 1 seconds...
Sleeping for 128 seconds...
Failed database connection attempt 9, message: not good enough!
Sleeping for 2 seconds...
Sleeping for 1 seconds...
Failed database connection attempt 10, message: not good enough!
Sleeping for 2 seconds...
Sleeping for 2 seconds...
'success

1.1	Intro to Retryers
1.2	Predefined Retryers
1.3	Building Complex Retryers