When WorkManager Stops Working

When using WorkManager, be wary about the implications of using unique work with the APPEND policy.

Unique work with APPEND is implemented (under the hood) via a chain of work. For example, if you request unique work two times in a row, it's represented as a chain like unique work -> unique work.

One aspect of work chains is that they automatically handle dependencies on failure. For example, if you have a chain A -> B, and work A fails, then WorkManager automatically cancels B as well.

Here's where the unexpected behavior arises: Because appending unique work is implemented as a work chain, if an appended unique work request fails (or is canceled), then all the unique work of that same name stops operating.

For example, suppose you've enqueued unique work twice. It'll start in this state:

unique work (ENQUEUED) -> unique work (BLOCKED)

If that first work runs but fails, then you end up here:

unique work (FAILED) -> unique work (FAILED)

Even though you queued it twice, because the first time failed, the second time automatically fails as well.

It gets even worse: imagine now that you append third instance of unique working. It'll be set to FAILED before it even begins, because again, it's in a failed dependency chain:

unique work (FAILED) -> unique work (FAILED) -> unique work (FAILED)

What does one do about this?

Theoretically, WorkManager will periodically clean out its history and (in doing so) get rid of the failed unique work that was blocking the chain. However, I don't want to wait for WorkManager to clean itself to fix the problem; I want reliable work.

As such, I've got two mitigation strategies:

  1. Never allow unique work to fail. That means that error states should be represented by returning SUCCESS + data that indicates failure, rather than FAILED. It also means wrapping all unique work in one large try-catch, because otherwise an unexpected exception can cause a FAILED state.

  2. Check for logjams and unblock using WorkManager.pruneWork(). Not ideal since it has ripple effects on work beyond your unique work, but there's no way to specifically prune one chain.

Both of these solutions feel like hacks, so I'm not proud of them, but a blocked work queue is real bad news when the work is vital to your app.