Idea Factory: Moved to https://znvy.substack.com: rename

Introduction

Cheap s3 storage comes with unusual cost: correctness. One of the key problems while working with highly partitioned data stored in s3, is the problem of eventual consistency in file listing. What exactly is this problem and how can we think about mitigating its impact: we will discuss in this post.

File listing is a very common operations since the invention of files. Given a directory or path, it gives us the list of files under that path. Tons of lines of code written over file systems, depend on correctness of this operation. Unfortunately, this assumption breaks when the listing is done on s3 or for that matter any blob store.

Deconstructing file listing

One way to think about eventual consistency of file listing is to argue that we get a wrong answer. This is correct to some extent but not powerful enough to do something about it. To do something about it, we need to dig a bit deeper and understand the nature of this wrongness. I find it useful to characterise this in the following form:

Ghost files
Conceived files

Lets try to understand what they mean. Ghost files are files which are listed by the file listing operation but they have actually been deleted from the file system. This is a very common reason for job failures in spark and hive. They are called ghost for obvious reasons. Conceived files on the other hand are those files which actually exist, but were not returned by the listing API. In the happy (immediately unhappy) path, eventual consistency causes jobs to fail, because further operations on ghost file keep failing, irrespective of the number of retries. In the unhappy (short term happy) path, we have data loss because of conceived file, because they are simply missing in the final result, resulting in incorrect answers.

Given these two constructs, we can argue that the wrongness of a listing operation will occur either because of Ghost files (files which shouldn't be present in the listing but are) and conceived files (files which should be present in the listing but are not there). We can now have separate solutions for dealing with detection and consequences of these two file classes.

Dealing with Ghost files

Ghost file are files which shouldn't have existed in the listing API to start with. Once they show up, they cause different problem depending on what operations we are doing with these files. Most common problem would be subsequent file not found errors. One way to deal with this is to do a fresh listing operations and do a set subtraction.
Let A be the result of listing operation at time t1
and B be the result of listing operation at time t2,
where t2 > t1.
Set A-B i.e the files which are in A but not in B, is essentially the list of currently known ghost files. Once detected, we can choose to deal with them in some form. One simple way is to ignore the failures caused by ghost files, because we know they should fail. The other option is to remove them from our task queue, because we know they are not part of the final solution set. We might need to iterate multiple times (say, till a fixed point) to find out all the ghost files.

Dealing with Conceived files

Conceived files are the files which didn't even show up.
Lets again consider that A be the result of listing operation at time t1
and B be the result of listing operation at time t2,
where t2 > t1.
Set B-A i.e the files which are in B but were not in A, is essentially the list of current known conceived files. These are files which we would have missed if we only do a single listing operation. Once detected, we can choose to deal with them in some form. Handling of conceived files is relatively simple. We just need to add them to our working set. We might need to iterate multiple times (say, till a fixed point) to find out all the conceived files and treat them as if they were part of the original listing operation.

It is tempting to say why not wait until the system has become consistent before starting the work. In theory it works, it practice we don't know how much time it will take. Starting with whatever information we can get from the listing API, we get a head start and can keep revising our task set depending upon what further listing API reveals. What we get through this approach is correctness but without introducing any performance penalties.

In conclusion, we can deal with eventual consistency in file listing operations by repeating the listing operation, detecting ghost and conceived files and modifying our work queues to take our new knowledge about the listing status into account.

Idea Factory: Moved to https://znvy.substack.com

Saturday, July 29, 2017

On eventually consistent file listing

Introduction

Deconstructing file listing

Dealing with Ghost files

Dealing with Conceived files