Sync Failure Handling

In the last article we discussed how we sync our changes with the server. Ideally, those HTTP requests would always work - if only we could walk the rosy path of best-case scenarios! Unfortunately, when it comes to network requests, all sorts of madness can occur. It’s best to bake madness-handling systems into your code instead of assuming everything will be alright.

Our attitude was to consider successful uploads to be the exceptional case; that way, we would make sure to handle every possible way the upload could go wrong.

What absolutely terrified me was the possibility of the entire system falling apart. If I’ve got a queue of data and the queue gets permanently blocked, the end result is a frustrated user. My number one priority was making sure that, even in the worst of times, the user isn’t completely screwed.

Here are a few aspects of our failure handling worth highlighting.

To Retry or Not To Retry

Sometimes the source of a failed upload is nothing more than a spotty network connection. Other times it’s because the Trello server is having issues. Occasionally, it’s because we wrote a bug into the code (this happened frequently while developed offline). In the worst-case scenario, it’s because you’re trying to do something logically impossible (e.g. changing the name of a card that someone else deleted before you could sync).

All these network errors can be sorted into two categories: temporary and permanent. Most of our categorization comes from a list of HTTP status codes that we consider “temporary” issues; everything else is a permanent failure.

Permanent errors are, in some ways, easy to handle: we drop the delta and move on. This may result in a cascade of future errors, but there’s not much we can do about it.

Temporary errors are more interesting. We want to retry uploading the delta, but obviously we want to wait a bit first, both to avoid whatever temporary problem came up and to avoid slamming the server with requests. To that end, we use exponential backoff on our sync service when a temporary error comes up.

We also cap the number of times a single delta can be retried, just in case there’s some case we missed that’s actually permanent rather than temporary.

Idempotence

One of the most insidious temporary errors you’ll see happens like this: the client sends a request, the server receives it, but then the client never receives the server’s response (due to bad network conditions). From the client’s side, it looks like the request failed; but in reality, it succeeded!

Imagine we send out the request to create a card and it fails in the way outlined above. The client decides to retry the request, so it asks the server to create a card again… and now we’ve got two cards instead of one!

The solution to this problem is idempotence, an important concept when working with any network request that might be retried. In this context, idempotence means that the same request can be received by the server multiple times, but the server only takes action the first time. Thus, there’s no accidental duplication of actions.

For our idempotence solution, we went with client-generated UUIDs that identify unique requests. If the same key is received by the server twice, then it ignores later requests, instead returning to the client the data it should’ve received the first time.

Conflict Resolution

When we first started working on offline, we thought conflict resolution (that is, when two users edit the same field) would be our number one problem. In the end, it ranked very low on our list of concerns. Just getting data to sync was orders of magnitude harder than conflict resolution.

Once we started tackling it, an idea came up that initially sounded absurd: do we really need fancy conflict resolution in any way, or could we just do last-writer wins?

It turns out that in most cases, last-writer wins is good enough. Most of the data on Trello isn’t edited concurrently. Plus, this system is far simpler for users to understand - we don’t have to teach them how diff works just because they dared to edit a field.

We know this system doesn’t work well for long-form fields like descriptions. We’re monitoring that situation using some analytics that detects how often fields conflict (and how much they differ by). We may come up with a solution for it eventually. Even then, we may not force users to do diffs, instead supporting something like history so the user can recover lost data.

Reverting Data

What happens when we truly fail to upload a change? How do we revert the change in the database so the user doesn’t think it’s good to go?

Our initial solution was lazy but works fairly well: just let future GETs from the server blow away local changes to the database. That way, the server is still the source of truth about what changes made it or not.

The downside of this approach is that it can also blow away legitimate changes that haven’t been synced yet. There were two ways around that: first, always try to upload changes before downloading new information. Second, replay changes made offline back onto the database whenever we update the data from GET requests.

As time goes on, we’re probably going to shift away from using GET requests to having better knowledge of how to revert changes. But as a quick solution that works well enough.


This article was originally posted on the Trello engineering blog and has been reproduced here for posterity.