Tuesday, October 11, 2011

Multi-threaded vs Multi-process

Cacheismo doesn't use threads. So does redis. Memcached is multi-threaded. How to choose?

Multi-threaded applications are hard to code and understand. Non-blocking multi-threaded applications are close to nightmare. I had the honor of writing multi-threaded non-blocking http proxy @apigee. It was written in c, had JNI bindings with java and could talk to our custom interface with kernel using shared memory (to avoid kernel space to user space data copy).  Having done that, I choose not to use threads in cacheismo.

Scalability of multi-threaded applications with multiple cores is simply a function of how much work can be done in parallel. The lesser the conflicts between the threads, better is the performance. The first problem is distribution of work among the requests. If multiple threads are working  concurrently on same piece of data, it is hard for them to make progress. Memcached uses a single selector thread to bind connections to different threads. It is not optimal. It will become bottleneck when you have thousands of clients trying to connect to the server at the same time. May be all connections of one thread are sleeping whereas all connections of some other thread are killing the cpu. So even though we have some work, few threads are not bothered because it doesn't belongs to them. This is kind of not possible with memcached (it hardly uses cpu), but very much possible with cacheismo running user scripts.  Now consider the hashmap itself. One hashmap, multiple threads, unavoidable lock. Possible solution here would be striping. What about slabs. One possibility is per thread slabs or some form of thread local caching or instead of using lock at slab level, use per slab locks.

In short, to make multi-threaded code work at optimal levels, we have to looking at our objects in finer granularity than what we originally planned. We have to make our design sub optimal to make sure threads make progress, even if at the cost of under utilization of resources.

Multi-process model works the opposite way. Consider your complete code/process as an object and create multiples of them. Have some way to distribute requests to these individual objects/processes. In many cases it is possible, specifically when either clients have knowledge of servers (consistent hashing) or if some intelligent proxy can do this for us (http load balancer). Even this could be non-optimal, but at least it is simple. What else? Memory leaks are easier to manage with multi-process model. Just restart the process. Since memory is managed at process level, it can be easily reclaimed by killing the process and if we have multiple of them, the service is not disrupted for every user. In a multi-threaded model, restart will impact all users. No locks are needed in multi process code, unless we use shared memory. The process boundary gives exclusive access to memory which doesn't need further management.

I don't see multi-process code in any ways inferior to multi-threaded code. It has some benefits like simple to code, simple to debug, simple to maintain. Other beauty is if instead of running 1 process with multiple threads you are running multiple processes with single thread, crash in the first case will cause full system failure, whereas in second case you still have (n-1)/n percent of the system up.  It is easy to upgrade things one at a time because we have more than 1 thing. With single server, no option but to shut it down, upgrade and start.

The only problem I see with multi-process code is duplication of basic data structure and configuration.  For example running each requests in a separate java process in tomcat or jboss will be a really bad idea. There is simply too much shared state - db connection pool, byte code, runtime native code, session information, etc.   The other problem is a bit superficial one; instead of managing one process, we have to manage multiple of them (use scripts?)

If your solution is either stateless or it can be partitioned such that there is no interdependence, multiprocess will make more sense. It is a difficult choice, but just don't use threads if you can survive without them, specially with non-blocking code. Try  cacheismo and see for  yourself.

7 comments:

  1. Hey Rohit,

    Some good thoughts there. Clearly, developing a multi-threaded system that scales with cores is no trivial task keeping in mind all the complexities of shared data contention, locking, non-blocking IO, thread-connection stickiness, scheduling etc.

    While new language constructs/features (Java's Fork/Join, Agents in Functional Programming languages) are a step towards effectively leveraging multiple cores, for most class of problems a process approach could work out really well. Especially with the virtualization and horizontal scaling trend, a process oriented application clearly will be the right way to leverage.

    Ironically, the whole middleware stack that ruled the last decade was meant to solve this problem (among other things) and keep application code agnostic of the threading issues.

    ReplyDelete
  2. Thanks Ramesh. I remember discussing this with you and Ashutosh. Just penned in down ;)
    I used refcounting at multiple places in cacheismo and just the thought of putting locks in all those objects and using atomic increment/decrement operations all over gives me shivers. And then I thought it is so much simple to just run multiple of these processes instead of struggling with threads.

    ReplyDelete
  3. Was listening to Martin Odesky (Scala's author) in one of the podcasts on domain specific functional languages. He asks a very interesting question: Given that we are seeing processors with several hundreds of cores on the bleeding edge, couple of years down the line, it would be a common trend to see. The question then is, assuming that you have a problem that needs all these cores, what programming language/framework one would use to solve. He may be right in that there wouldn't be a single solution because to effectively leverage those many cores, one needs to be able to identify and codify the task parallelism. And he goes on to say that Domain specific languages would be be the best way to do that.

    ReplyDelete
  4. One of my self invented uses case for cacheismo is to be able to do map-reduce or something similar. I have already added support for cacheismo getting data from other cacheismo server. The next step is to add parallel get support. Given that, I believe cacheismo cluster can be thought of as some sort of universal RAM which stores objects accessible via keys instead of address. Consider the problem of top 10 keys in the cluster. One cacheismo server does a parallel get on all other cacheismo server and asks them to return top 10 keys. Merges all keys locally and returns the result. This effectively is what map-reduce does. This could be done with a single server with 1024 cores running 1024 cacheismo instances. I guess with parallel get what we get is a notion of a parallel stack, a true NFA, spread over the network. Since cacheismo uses lua coroutines, it could be doing thousands of such queries at the same time without blocking any threads. I guess the best part of this framework is that you are still in the familiar function call semantics except they are running in parallel on multiple cores/servers.

    ReplyDelete
  5. Interesting. Not directly related, but I see this more a trend in all modern day NoSQL DBs.

    http://blog.fiesta.cc/post/10980328832/walkthrough-a-mongodb-map-reduce-job

    So, in your example of cacheismo map-reduce, it would be nice to see this as a standard map-reduce primitive to deal with some really cool in-memory analytics problems.

    ReplyDelete
  6. Here is how the top keys implementation will look like in cacheismo.

    Create a file called mapreduce.lua in scripts directory with following functions...

    -- sort of static function
    -- takes number of keys to return as argument.
    -- virtual key => mapreduce:topKeys:count

    function topKeys(count)
    local keys = accessCounter:getKeys()
    -- sort keys by access count
    table.sort (keys, function (a,b)
    if (a:getAccessCount() > b:getAccessCount()) then
    return true
    end
    return false
    end)

    local result = {}
    local i = 1
    for (i in 1..count) do
    result[i] = keys[i]
    end
    return result
    end

    -- helper function
    function mergeKeys(total, newMember)
    for k,v in pairs(newMember) do
    local newValue = v + total[k]
    total[k] = newValue
    end
    end

    -- manager function which initiates the map
    -- reduce calculation
    -- virtual key => mapreduce:manageGetTop:count

    function manageGetTop(count)
    local query = {}
    -- creates a parallel query with key as
    -- server and virtual key as value
    for k,v in pairs(servers) do
    query[k] = "mapreduce:topKeys:"..count
    end
    local results = getInParallel(query)
    local total = {}
    for k,v in pairs(results) do
    mergeKeys(total, v)
    end
    table.sort (total)
    local topkeys = {}
    for (i in 1..count) do
    topkeys[i] = total[i]
    end
    end

    ReplyDelete
  7. Neat. Compare this with other non-blocking event-driven programming approaches (such as Node.js) and see how simple it is. May be something you can highlight with code examples in another post :-)

    ReplyDelete