Thursday, December 22, 2011

Economics of a slow judicial system

Let me go first to to non obvious and we will deal with the obvious later.

If it takes 10 or 20 years for the court to give judgement, who benefits from it? I guess it is the lawyers. The question is, is the judiciary slow or is it the lawyers who use the loopholes of the system to make cases last forever. If judiciary was fast and people didn't fight much in the courts, the only people who stand to loose is the lawyers.  Net income of lawyers is dependent on number of concurrent cases they are handling. To maximize their income, they should make sure that none of the cases they fight reach any judgement, because that is like losing a customer and why would a lawyer want it. Just a thought. No facts.

The obvious part is, even if you committed a crime and have enough money, lawyers will keep the judgement out of the way to let you live free.

This is another instance where demand supply completely breaks. The business depends upon the idea of selling justice, but not delivering it..because the moment it is delivered, no more money can be extracted from the customer.

Wednesday, November 30, 2011

Why power outlets are below the desk?

I have one pain in my life that I go through twice a day. The pain of plugging in laptop cable in the morning and taking it out in the evening. For some reason all carpenters/designers seem to find power outlets hidden below the desk more appropriate instead of providing one over the desk.

The reason is offices or the modern offices were invented in the era of desktops and it made sense to keep all the wires hidden below. Once connected, no one needs to touch them ever. Unfortunately most of world has moved on to laptops, but the modern office design has become a little obsolete. 

Friday, October 21, 2011

Use cases for Cacheismo

Cacheimso is a scriptable in-memory lua object/key value cache, written in c and works on most posix compliant systems. It is based on a simple idea of exposing objects with methods instead of opaque values.

  • Simplest use case would be to use cacheismo as a key value cache, just like memcached. It supports the tcp ascii memcached protocol. No need to specify or tune slab sizes. You would be surprised by improvement in your cache hit rate. This comes from using special memory management scheme to optimize memory usage.  Cacheismo Memory Management 
  • Second use case for cacheismo is when you would like to store server side state which is frequently modified. Typically this is where you would use CAS operations in memcached.  Example could be counters, rate limiting, last N users, etc. If you use memcached for storing session objects, it would be essential to map users to specific machines so that session information is not corrupted because of concurrent modifications to session state. With cacheismo you could keep your session object in cacheismo server and let any app server to handle any user request.  Cacheismo allows creating server side objects in lua scripting language which can be atomically modified by use of get operations on special keys called virtual keys.  Cacheismo running on single core can at max do about 80K get operations per second and this includes running a minimal script on every get operation. Your mileage will depend upon complexity of the script and size of your objects.   Sliding Window Counter Quota With Cacheismo Cacheismo Lua API
  • Cacheismo also supports talking to other cacheismo server from the scripting environment. This is based on two functions getFromServer and getInParallel.  These can be used to create synchronous  replication, proxy functionality to route requests to different servers and for in memory parallel computations.  Cacheismo Cluster Using Cacheismo Cluster For Real Time Analytics 
  • Cacheismo also supports querying for existing keys through scripts. This can be used to create replica of an existing system if the current system needs to be shutdown for any reason. Finding Keys In Cacheismo
  • Cacheismo supports server side consistent hashing. If you find it difficult to update all your app machines when new servers are added, you could use this feature to do seamless upgrades. This comes with a network cost. One hop to the first cacheismo server which finds the right node and second to the right node for the key. 

Thursday, October 20, 2011

Education for Education

Education loans are based on the same model as other loans, the difference being that interest rates are less compared to other kind of loans.  The model makes sense since banks are in it for money. But what if this was done by some non profit organization whose only function is to make it easy for people to get educated.

Lets say it cost C1 at time T1 to get education E. Lets say at some time T2 in the future the cost of same education is now C2.  Lets say "payback" amount for an education loan is simply a function of current cost of that education. You can "payback" at any point in time, after 2 years or 10 years or 20 years or do partial payments. Basically what you get at start is "cost of education" and what you payback is also "cost of education".  Essentially I am decoupling "cost of money" which is interest that banks charge over time from "cost of education".

Under normal circumstances, this would have ensured that non profit organization which has funds to support education of N persons, will continue to have funds to support N persons at all times in the future.  When I looked up wikipedia for cost of education is US, it looks like this would have been a very bad deal given that cost of education in US has increased 2.5 times more than inflation (which is a good indicator of interest rate scheme).  I don't know what is going on here.

The economic feedback loop is pretty big here.
 => High cost of education
 => Student debt
          => Unable to pay
                => Economic instability or political unrest
                     => Depression ?
 => Choose not to go for higher education
          => Less qualified people for the job
              => More salaries for qualified people
                     => Now debt for higher education is justified
 => Universities make loss, less students enrole
        => Universities decrease prices
            => Stability - cost of education is justified

It takes years before we know the mistakes and rectify them.  I keep coming back to the same point, money is a bad abstraction. We need a better model for making people help other people now, in belief that they will be helped when needed in future.

Index Based Salaries

Prices change. Some change every day, others may be once a month. Salaries are fixed, unless company decides to change it.

When government decides to increase petrol prices, price of almost everything increases because almost every business depends on transportation. But salaries don't change. When SBI increases cash reserve ratio or base rate, EMI increases, production decreases but salaries don't change.

It is kind of unfair that everyone in business can increase/decrease prices either based on market conditions or increase/decrease in cost of doing business but salaried people don't have that choice. The only choice they have is - talk to your manager or find a new job.

What if salary was a function? Say Hari Sadu has a home loan, drives a car and has children studying in school.  If his salary was linked to his home loan rate, cost of petrol and cost of education, he would be a very happy employee. Just like businesses can adjust their prices, if salaries could be adjusted the same way, it would create a much faster feedback loop for the economy to adjust to the new conditions. If a business decides that it cannot support salary function of some employee and wants to decrease his salary, it will come out clean and open. The current system of things, simply says things are same (salary stays the same) but actually employee is taking a hit and it stays unsaid, unacknowledged.  A 10% salary raise at yearly performance evaluation might be less that inflation or much much less than "impact of inflation" for a given employee.  In this new system, raise is a raise, it gives employee more spending power.

This is obviously a bad move for the companies because it makes one of fixed costs of business, variable. But what it also does is that it brings transparency. Just as businesses understand and adjust to prices of other businesses, it makes sense to understand and adjust to impact of prices on your employees.  If home loan rate increase by HDFC is eating into margins of Wipro or Infosys, they are in a better position of negotiate with HDFC than each employee on his own and then may be HDFC will figure out a deal where Wipro or Infosys charges less for software.  I guess all I am talking about is businesses being more people aware than money aware, because money in itself can do nothing, only people can.

Tuesday, October 18, 2011

Finding Keys In Cacheismo

In the last post I talked about how to do map-reduce like stuff with cacheimso. It occurred to me that their is no method to get all the keys. The global hashmap object now supports new method getPrefixMatchingKeys which as the name implies returns keys which match a certain prefix. Typically objects are stored with following template objectType$objectKey. Thus an object created by file quota.lua with key myKey will be stored with key quota$myKey. To do something with all the quota objects:
local keys = getHashMap():getPrefixMatchingKeys("quota$")
for k,v in pairs(keys) do 

If you would like the quota object itself to support say getAllKeys method then ...add this to quota.lua in scripts directory.

function getAllKeys() 

    local keys = getHashMap():getPrefixMatchingKeys("quota$")
    local result = ""
    for k,v in pairs(keys) do 

Now a get request with key quota:getAllKeys would return a new line separated list of all active quota objects in the map.

This is good enough but probably not very interesting. I am planning to support indexes as first class objects to make it faster to reach interesting objects in quick time. These indexes will automatically remove objects which are deleted or forced to expire because of LRU caching. So if you need to find all the quota objects that are close to quota limit, create an index on quota used value.

Thursday, October 13, 2011

Using Cacheismo Cluster for Real Time Analytics

Cacheismo now supports invoking requests on other server.
  - getFromServer
    This function takes the server name and a key as an argument and returns the result to the script.
  - getInParallel
    This function takes map of server names to list of keys and gets values for all the keys from all the servers in parallel. Once all the results are received, the script gets a map of server names to map of keys and received values.

Here is a simple example code which gets to top accessed keys from the cacheismo cluster using a simplified, framework-less map-reduce.  The contents below should belong to file names mapreduce.lua

-- function called on every cacheismo node to get 
-- the top keys by access count
-- virtual key : mapreduce:topKeys:

function topKeys(count) 
    local keys  = accessCounter:getKeyAccessCount()
    -- sort keys by access count
    table.sort (keys)
    local result = {}
    local i      = 1
    for (i in 1..count) do 
       result[i] = keys[i]
    return result

-- function called on one of the cacheismo nodes 
-- which distributes work to all other nodes using getInParallel
-- virtual key : mapreduce:manageGetTop:10 

function manageGetTop(count) 
     local query = {}
     for k,v in pairs(servers) do 
        query[k] = "mapreduce:topKeys:"..count
     local results = getInParallel(query)
     local total = {}
     for k,v in pairs(results) do 
           for k1,v1 in pairs(v) do 
              total[k1] = v1
     table.sort (total)
     local topkeys = {}
     for (i in 1..count) do 
       topkeys[i] = total[i]

I have taken some liberties with the syntax and their is no accessCounter object by default in cacheismo, but is fairly easy to create in the script.  Note that the above implementation doesn't have ant intermediate nodes, but is trivial to create by calling get for mapreduce:manageGetTop instead of calling mapreduce:topKeys.

Single cacheimso server might be serving thousands of such queries because everything is happening in non-blocking manner but we give user the illusion of synchronous stack using lua coroutines.

Cacheismo Introduction
Cacheismo Lua API
Cacheimso Sliding Window Counter Implementation in Lua
Cacheismo Quota Implementation in Lua
Cacheismo Memory Allocation
Cacheismo Source Code
Cacheismo Support

Tuesday, October 11, 2011

Multi-threaded vs Multi-process

Cacheismo doesn't use threads. So does redis. Memcached is multi-threaded. How to choose?

Multi-threaded applications are hard to code and understand. Non-blocking multi-threaded applications are close to nightmare. I had the honor of writing multi-threaded non-blocking http proxy @apigee. It was written in c, had JNI bindings with java and could talk to our custom interface with kernel using shared memory (to avoid kernel space to user space data copy).  Having done that, I choose not to use threads in cacheismo.

Scalability of multi-threaded applications with multiple cores is simply a function of how much work can be done in parallel. The lesser the conflicts between the threads, better is the performance. The first problem is distribution of work among the requests. If multiple threads are working  concurrently on same piece of data, it is hard for them to make progress. Memcached uses a single selector thread to bind connections to different threads. It is not optimal. It will become bottleneck when you have thousands of clients trying to connect to the server at the same time. May be all connections of one thread are sleeping whereas all connections of some other thread are killing the cpu. So even though we have some work, few threads are not bothered because it doesn't belongs to them. This is kind of not possible with memcached (it hardly uses cpu), but very much possible with cacheismo running user scripts.  Now consider the hashmap itself. One hashmap, multiple threads, unavoidable lock. Possible solution here would be striping. What about slabs. One possibility is per thread slabs or some form of thread local caching or instead of using lock at slab level, use per slab locks.

In short, to make multi-threaded code work at optimal levels, we have to looking at our objects in finer granularity than what we originally planned. We have to make our design sub optimal to make sure threads make progress, even if at the cost of under utilization of resources.

Multi-process model works the opposite way. Consider your complete code/process as an object and create multiples of them. Have some way to distribute requests to these individual objects/processes. In many cases it is possible, specifically when either clients have knowledge of servers (consistent hashing) or if some intelligent proxy can do this for us (http load balancer). Even this could be non-optimal, but at least it is simple. What else? Memory leaks are easier to manage with multi-process model. Just restart the process. Since memory is managed at process level, it can be easily reclaimed by killing the process and if we have multiple of them, the service is not disrupted for every user. In a multi-threaded model, restart will impact all users. No locks are needed in multi process code, unless we use shared memory. The process boundary gives exclusive access to memory which doesn't need further management.

I don't see multi-process code in any ways inferior to multi-threaded code. It has some benefits like simple to code, simple to debug, simple to maintain. Other beauty is if instead of running 1 process with multiple threads you are running multiple processes with single thread, crash in the first case will cause full system failure, whereas in second case you still have (n-1)/n percent of the system up.  It is easy to upgrade things one at a time because we have more than 1 thing. With single server, no option but to shut it down, upgrade and start.

The only problem I see with multi-process code is duplication of basic data structure and configuration.  For example running each requests in a separate java process in tomcat or jboss will be a really bad idea. There is simply too much shared state - db connection pool, byte code, runtime native code, session information, etc.   The other problem is a bit superficial one; instead of managing one process, we have to manage multiple of them (use scripts?)

If your solution is either stateless or it can be partitioned such that there is no interdependence, multiprocess will make more sense. It is a difficult choice, but just don't use threads if you can survive without them, specially with non-blocking code. Try  cacheismo and see for  yourself.

Tuesday, October 04, 2011

Cacheismo Cluster Update 2

Today I could set the value on server 1 and get it back from server 2.  This script of server 2 was changed a bit.

function handleGET(command, key) 
    local cacheItem = getHashMap():get(key)
    if (cacheItem ~= nil) then
   local result = command:getFromExternalServer("", key)
   if (result ~= nil) then
       writeStringAsValue(command, key, result)

This is where I feel lua beats javascript because of its excellent coroutine support.  I could easily change the above code to a while loop and go through all servers in the cluster looking for a value and it would not change the complexity of the code a bit, but try doing it with javascript callbacks and it would look like a mess. 

I am thinking now that the original idea of using dataKey and virtualKey is a bit lame. What I would do now is that I will make server an explicit argument, just like I used above. For consistent hashing we will have an object exposed in lua which can be used to find the correct server via consistent hashing, but it would be optional to use. This gives users the flexibility to use some cacheismo instances as pure proxies and others as caches. Some of the servers can be part of one consistent hashing group and others can be part of another. For some reason if users want to use one of the servers as a naming server to map keys to server, they can use it that way. Essentially users have full flexibility with respect to how to map keys to servers  - consistent hashing, multiple servers hidden behind a proxy(cacheismo), naming server to map keys to servers or any combination of these.  

Other possibilities include replication and quorum. These can be accomplished with the current API itself, but I will add parallel get support to make it as fast as possible. 

Friday, September 30, 2011

Cacheismo Lua API

The scripting capabilities of cacheismo are very powerful. The code for handing commands like get/set/add etc is completely written in lua using the lua interface. Here is the code for handling SET command:

local function handleSET(command) 
  local hashMap   = getHashMap()   
  local cacheItem = hashMap:get(command:getKey())
  if (cacheItem ~= nil) then 
      cacheItem = nil
  cacheItem = command:newCacheItem()
  if (cacheItem ~= nil) then 
     command:writeString("SERVER_ERROR Not Enough Memory\r\n")
  return 0

This is the list of the lua methods which can be called by new scripts

- Global method, return instance of the global hashmap object.

- Sets the logging level. Four levels are defined DEBUG=0, INFO=2,
  WARN=2 and ERR=3. The function takes numeric argument.

The standard table object in lua is extended to support the following methods

- returns a string object which represents the serialized table.

- returns a fully constructed table from the given serialized table string

- returns a deep copy of the given table

The global hashmap object retrieved from getHashMap() supports following methods

- returns a cacheItem object for the given key if found, else nil

- takes a cacheItem object and puts it in the hashMap
- deletes the cacheItem object associated with the given key from the hashMap.

- deletes as many LRU objects as required to free up at least requiredSpace.

The data in the hashMap is stored in the form of cacheItem objects. We need to have
a cacheItem object to store in the map and cacheItem is what we get from the map
when we do a get. The object supports following methods. These objects are
read only.

- returns the key associated with the cacheItem

- returns the number of bytes in the key

- returns the time in seconds since epoch when the current item will expire.

- returns the flags associated with the cacheItem

- returns the size of the data stored in this cacheItem

- returns the data stored in the cacheItem as lua string. This is normally not
used because lua script don't need access to actual data unless using virtual keys.

- deletes the reference to the cacheItem object. Important to call this after get
from hashMap, after using the cacheItem.

This represents the request coming from the client. This can be modified if
required by the script.

- returns the memcached the command being used.
  One of get, add, set, replace, prepend, append, cas, incr, decr, gets, delete,
  stats, flush_all, version, quit or verbosity
- returns the key used in the command. This is nil in case of multi-get and other
  commands which don't have key as argument.

- set the given string newKey as the key for the command object.

- returns the size in bytes for the key

- returns the expiryTime specified in the command.

- sets the expiry time to new value newTime.

- returns the flags specified in the command. It is also used by the verbosity
  command to return the logging level.
- sets the falgs in the command object to new flags.

- returns the delta value used in incr/decr commands

- sets the delta value to new delta value

- returns the boolean noReply from the command.

- sets the value for the noReply

- returns the size of data in the command, in bytes. This works for set, add, etc.

- replaces the data in the command object with the newData string

- returns the data in the command object as lua string

- creates and returns a cacheItem object using the data in the command object.
  This is used with set/add/etc

- writes the cacheItem in the memcached response format on the client socket.
   "VALUE key flags size\r\ndata\r\n"
  cas is not supported.

- writes arbitrary string to the client socket. This is useful for writing
  "END\r\n", "STORED" and other standard memcached response strings.

- returns the number of multi-get keys in the command

- returns a table of all the keys in the multi-get command

Apart from these, few helper methods defined in config.lua are also useful.

writeStringAsValue(command, key, value) 
- writes an arbitrary string in the "VALUE key flags size\r\ndata\r\n" format.
  flags is 0 and size is calculated using string.len

executeReadOnly(command, originalKey, objectType, cacheKey, func, ...) 
- helper function for read only operations on lua tables stored in the cache.
 It retrievs the data from the cache, uses table.unmarshal to construct the
 lua table and calls the function func with this lua table as the first
 argument. Any extra args passed to this function are passed to function func.

executeReadWrite(command, originalKey, objectType, cacheKey, func, ...) 
- helper function for read/write operations on lua tables stored in the cache
  orginal object is deleted and a new object is created with the same key with
  latest data values

executeNew(command, originalKey, objectType, cacheKey, func, ...)
- helper function for creation of new objects based on lua tables. if the key
  is in use, it is deleted before creating the new object

See set.lua, map.lua, quota.lua and swcounter.lua for example usage.

Cacheismo sliding window counter

Sliding window counters are useful for maintaining information about some last n transactions. One example could be a site like pingdom storing last n ping times for a site. Cacheismo comes with a default sliding window counter example which can be modified to suit your needs.  The current implementation supports new, add, getlast, getmin, getmax, getavg.

Note that these are memached get requests with special keys encoding our intent.  The keys follow the following syntax:

Object type is "swcounter". Cacheismo has other object types as well and more can be added by adding new lua scripts. Methods for swcounter object are "new", "add", "getmin", "getmax","getavg", "getlast", Object key we are using is "myCounter". You would need key per counter. Number of arguments would vary depending upon the method. Here new takes 1 arg (number of values to store), add takes 1 arg (value to store). Rest all methods take no arguments.

get swcounter:new:myCounter:2
VALUE  swcounter:myCounter:100 0 8 
-- Create a new counter which stores the last 2 values
get swcounter:add:myCounter:100
VALUE  swcounter:add:myCounter:100 0 7 
--store 100 as first value 
get swcounter:add:myCounter:50
VALUE  swcounter:add:myCounter:50 0 7 
-- store 50 as second value
get swcounter:getmin:myCounter
VALUE  swcounter:getmin:myCounter 0 2
-- get the min entry among all the stored values
get swcounter:getmax:myCounter
VALUE  swcounter:getmax:myCounter 0 3
-- get the max entry among all the stored values
get swcounter:getavg:myCounter
VALUE  swcounter:getavg:myCounter 0 2
-- get the average value for all the stored values

The complete implementation in lua looks like this.
-- sliding window counter 
local function new(windowsize) 
  local  object = {}
  if (windowsize == nil) then
     windowsize = "16"
  object.size   = tonumber(windowsize)
  object.index  = 0
  object.values = {}
  local count   = 0
  for count = 1,object.size do 
object.values[count] = 0
  return object

local function add(object, value) 
   object.index = object.index+1
   if (object.index > object.size) then 
       object.index = 1 
   object.values[object.index] = tonumber (value) 

local function getlast(object)
   if (object.index ~= 0) then
       return object.values[object.index]
   return 0

local function getmin(object)
   local count = 0
   local min   = math.huge
   for count = 1,object.size do
       if (object.values[count] < min) then
          min = object.values[count]
   return min

local function getmax(object)
   local count = 0
   local max   = 0
   for count = 1,object.size do
       if (object.values[count] > max) then
          max = object.values[count]
   return max

local function getavg(object)
   local count = 0
   local total   = 0
   for count = 1, object.size do
        total = total + object.values[count]
   return (total/object.size)

Rate Limiting/Quota with Cacheismo

Cacheismo is a non-blocking single threaded high performance scriptable memcache. Instead of storing bytes or blobs in the cache, it can store objects defined in lua. No need to get the object, modify it and then save it back (with retry).  Directly modify object state in the cache itself by invoking methods of your objects. Because it is single threaded, modifications are automatically atomic. It is best suited for managing volatile application state at super speed (70K requests per second per core). Memcached would be more useful if your objects don't change much. But if they are changed frequently, memcached's update mechanism using retries would give you significantly low performance.

It comes with a default implementation of a quota object. You can change it as it suits you without touching the c code.  Here is how to use it.

Note that these are memached get requests with special keys encoding our intent.  The keys follow the following syntax:

Object type is "quota". Cacheismo has other object types as well and more can be added by adding new lua scripts. Methods for quota object are "new", "addandcheck", "reset". Object key we are using is "userKey". You would need key per user. Number of arguments would vary depending upon the method. Here new takes 2 args, addandcheck takes 1 and reset takes 0.

get quota:new:userKey:10:hour 
VALUE quota:new:userKey:10:hour 0 7 
-- Creates a new quota object in cacheismo, that enforces 10 requests per hour.
get quota:addandcheck:userKey:10
VALUE quota:addandcheck:userKey:10 0 8
-- Checking if user can use 10 units of quota. This succeeds. It is easy to create a default user object if it doesn't exist in this call also, instead of forcing users to always do a new first.
get quota:addandcheck:userKey:1
VALUE quota:addandcheck:userKey:1 0 14
-- Checking if user can use 1 more unit of quota. This fails as user exhausted it in previous call.
get quota:reset:userKey
VALUE quota:reset:userKey 0 8
-- Reset the quota back to 0.

Cacheismo can do around 70K such operations per second running on single core. Like memcached, if single instance is not enough, multiple instances can be used. Cacheismo supports memcached ascii protocol. So any memcached client can be used to access it.

Introduction to cacheismo

-- simple quota implementation 
local quotaTypesMap = { month = 1, day = 1, hour = 1 }

local function findExpiresAt(quotaType)
       local current ="*t")
       if (quotaType == "month") then
           current.month = current.month + 1
    = 1
           current.hour  = 0
           current.min   = 0
           current.sec   = 1
       elseif (quotaType == "day") then
    = + 1
           current.hour  = 0
           current.min   = 0
           current.sec   = 1
           current.hour = current.hour+1
           current.min   = 0
           current.sec   = 1
    return os.time(current)

local function new(limit, quotaType) 
  if (quotaTypesMap[quotaType] == nil) then
      return nil
  local  object    = {}
  object.limit     = tonumber(limit)
  object.type      = quotaType
  object.count     = 0
  object.expiresAt = findExpiresAt(quotaType)
  return object

local function addandcheck(object, value)
   local toUseValue = 1 
   if (value ~= nil) then
      toUseValue = tonumber(value)
   if (os.time() > object.expiresAt) then
       object.count = 0;
       object.expiresAt = findExpiresAt(object.type) 
   if ((object.count + toUseValue) > object.limit) then
       return -1
       object.count = object.count + toUseValue
       return 0

local function reset(object) 
   if (os.time() > object.expiresAt) then
       object.count = 0
       object.expiresAt = findExpiresAt(object.type) 
   object.count = 0
   return 0

Cacheismo Memory Allocation

Cacheismo uses two different kind of memory allocators. One is called chunkpool which is used to store the items and another is called fallocator which is used for request processing. Both have different design goals. chunkpool is designed to be reasonably fast but very compact. It tries to minimize memory wastage. fallocator on the other hand doesn't care about memory wastage but tries to be very fast. Here are the details.

fallocator is used when we are allocating memory temporarily and it is certain that it will be freed very soon. For example when reading data from socket and parsing the request, many small strings are generated but these are released as soon as the request is over, which could be a matter of micro seconds.   Cacheismo creates a fallocator per connection. fallocator not only allocates memory but also keeps track of memory. Thus if because of bad code we miss freeing some memory, it will be reclaimed when the fallocator is destroyed, essentially when the connection is closed.

fallocator internally allocates a big buffer (4KB) and keeps track of how much of the buffer is used. Malloc simply returns buffer+used at the pointer and increments used by size of malloc.  If the current buffer cannot handle the request, new buffer is allocated and memory given from new buffer. Used buffers are kept in a linked list. On free, fallocator decreases the refcount on the buffer which was used to allocate that memory. If the refcount becomes zero, the buffer is freed from the used buffers list. The benefit of this refcount based memory management is that it keeps memory usage to minimum. So if you are doing some malloc/frees in a loop, they will not exhaust the memory because refcount mechanism will ensure that they are released. If the malloc size is bigger than 4KB, fallocator defaults to malloc but keeps track of the buffer.

Moreover the 4KB buffer are cached to avoid going to malloc every time. This scheme of things give us O(1) malloc and  O(1) free in normal case (when  not going to system malloc).   It avoids fragmentation of the memory but could result in more memory being used than required at any point in time. It also gives us some protection against memory leaks.

chunkpool is used for storing the cacheitems and the goal here is to maximize memory utilization. chunkpool imposes two constrains on the application. One, it never allocates more than 4KB of contiguous memory. That is, if the application needs 8KB of memory, it can at max only get 2 4KB blocks and not a single 8KB block. Applications needs to be designed around this limitation. The second one is not a constraint, but an advice.  Apart from malloc, chunkpool exposes another function called relaxedMalloc which can return memory smaller than what application asked for. Thus if application asks for 4KB of memory, it might return 256 bytes or 1KB if it doesn't have a 4KB block. If the returned memory is too small, application can free it and deal with out of memory condition or if possible use large number of small buffers instead of small number of large buffers.

Because of these restrictions, fragmentation of memory doesn't makes the system unusable, but only adds  more memory management overhead. Internally we have slabs of size 16 + n16 upto 4096 bytes. Initially all memory is allocated to the 4KB slab. If applications asks for say 1KB of memory we split the 4Kb buffer into two buffer on of 3Kb and another of 1Kb. The 1 KB buffer goes to the application and the 3KB is stored in the 3Kb slab. Cacheismo chunkpool uses skiplist to keep track of non empty slabs. At runtime if we found that the slab of asked malloc size is empty, it quickly find the next not empty slab, splits the buffer and returns memory to the application. On free, chunkpool tries to merge the free buffers to create bigger buffers.  One limitation that I will fix later is that chunkpool memory uses the first four bytes for memory management. The implication is that any free buffer can only merge with logical next buffer and not logical previous buffer. This would require storing buffer size also at the end of the buffer and not just at the start.  To circumvent this problem, chunkpool requires an extra GC call where it tries to merge 8MB of pages at a time.  This is called once every second.  This is triggered only if at least 12% of memory is free and average chunk size is less than 256 bytes other wise it is a no-op. Even though chunkpool is capable of addressing upto 64GB of memory, it is advisable not to go beyond say 8GB.

chunkpool is a very efficient allocation manager for cache data, when compared with slab allocator.  It saves user the trouble of first thinking about what slab sizes should be used and maintains high HIT Rate when object sizes are changed because of changes in application code.  Cacheismo is not faster than memcached when comparing raw TPS, but when you take HIT Rate into account, it saves you lots of db accesses and increases overall application performance.  You can download it from here.

Wednesday, September 28, 2011

Cacheismo Cluster Update

I have started work on cacheismo cluster. I should have something working in next couple of days. The interface for lua scripts to get values from the cluster would need small change. Instead of only specifying the key, the caller will need to specify both dataKey as well as the virtualKey when looking up for a value. Cacheismo will use the dataKey to decide the server to which the virtualKey would be sent so that we don't end up in a endless loop. For dataKey's lookup is anyways simpler.

The cacheismo cluster specification (set of servers in the cluster) would be exposed via special virtual keys, so that it can be updated just like other objects. Each of the servers will monitor all other servers in the cluster and update its consistent hashing table so that requests are routed appropriately. This places a restriction on the number of servers we can add in a cluster. Maybe at some point what we will need is neste or multi-level consistent hashing where a special node will stand for multiple other nodes, something like a router which connects one network with another.

UPDATE: Cacheismo Cluster Update 2

Monday, September 26, 2011

Cacheismo Next Steps

Most important limitation of current approach is that cacheismo cannot be used in clustered mode. The reason being virtual key hashing might choose a server different from where the actual key is stored.
hash("set:new:myKey") is not equal to hash("set$myKey") is not equal to hash("set:count:myKey")

I did some experiments with lua threads and they look pretty cheap to create. The current plan is to make cacheismo nodes aware of each other and add client side support to cacheismo server. Instead of just looking up the key in local server memory, cacheismo will also do the hashing at server side on the actual key and fetch the results from the responsible cacheismo server.  This would double the latency but ensure that any virtual key can be thrown at any server and it will fetch the results.

For example consider something like "set:intersection:mySet1:mySet2:mySet3".
Lets assume that  mySet1, mySet2 and mySet3 are stored on different servers.
Trivial implementation would be to get mySet1, mySet2 and mySet3 to server executing the request and do the processing. Another possibility is to get sizes of mySet1, mySet2 and mySet3 and execute two parallel "set:intersection:smallest:other1" and "set:intersection:smallest:other2", followed by local intersection over the result.

Since cacheismo is scriptable, it is always possible to change this logic.  I will hide the non-blocking details using lua co-routines so writing such objects will not require "knowing" the asynchronous IO aspects, just some friendly function call like getValueForKey.

It might make it possible to do some in memory map-reduce with recursive calls for virtual keys running parallel stacks spread over the network, executing at the same time.

UPDATE: Cacheismo Cluster Update

Saturday, September 24, 2011

Introducing Cacheismo

Cacheismo is scriptable object cache which can be used as a replacement for memcached. It supports memcache protocol (tcp and ascii only). It automatically manages slab sizes, no user configuration is required.  Slabs are intelligently managed and they reconfigure themselves as per  the usage. PUT never ever fails. Details follow:
- Efficient use of memory 
Slab allocator used in memcached would waste memory if slab sizes and object sizes are not in sync. For example when using slab sizes of 64/128/256/512/1024 bytes object with size 513 is stored in 1024 bytes slab, wasting almost half the memory. Cacheismo tries to  maintain memory used by object as close as possible to the object size
- Reclaimable Memory 
Every byte is reclaimable in cacheismo. PUT never fails. You will never have slab full issues. More on memory management.
- True LRU support. Dependable cache
Memcached will evict items from the slab in which object will be placed. Cacheismo will always use the LRU, irrespective of the size. This ensures that the most important data is always in the cache. No surprises. If something is not in the cache, it either expired or was least used.
- Server Side Scriptable objects
The best part of using cacheismo is that is it fully scriptable in lua. Instead of cluttering your code with workarounds for  memcached read/write semantics, you can move that logic inside  cacheismo.  See Use cases for cacheismo Rate Limiting Quota With Cacheismo Using Cacheismo Cluster For Real-time Analytics
Sample objects map, set, quota and sliding window counters written in lua can be found in scripts directory. You could create  your own objects and place them in the scripts directory. The interface for accessing scriptable objects is implemented via memcached get requests. For example:  get set:new:mykey   would create a new set object refered via myKey get set:put:myKey:a   would put key a in the set myKey get set:count:myKey   would return number of elements in the set get set:union:myKey1:myKey2   would return union of sets myKey1 and myKey2 See scripts/set.lua for other functions. I call this approach virtual keys. Reason for using it ... You can use existing memcached libraries to access cacheismo. Building - cacheismo depends on libevent and lua (luagit can  also be used). Complete implementation is in about 5000 lines  of code and fairly organised. 
Blue line is (memcachedTPS-cacheismoTPS)/memcachedTPS. This essentially measures how much faster is memcached compared to cacheismo. As you can see at max it is 20% faster.
Black line is (cacheismoHitRATE-memcachedHitRATE)/cacheismoHitRATE. This essentially means how much memory efficient is cacheismo compared to memcached. For the first half of the test case we don't see any difference. But in the second half cacheismo is at max about 80% efficient. The reason for this behavior is that memcached can't reassign slabs once they are used whereas cacheismo continues to adjust with changing object sizes. Over time memcached commits memory to some slabs and hence its hit rate decreases over time.
Get source from GitHub
Google Group
UPDATE: Cacheismo Next Steps

Friday, September 23, 2011

What to do when your code suddenly slows down by 60%?

before running the tests multiple times to confirm that this indeed is the case...
before looking into the recent changes...
before looking into the test code...
before checking if swap is being used...
before checking if logging is enabled...
before doubting the kernel upgrade...
before using strace -c to find out unusual system calls...
before using valgrind --tool=callgrind to profile the code...

The first thing to do is to check cat /proc/cpuinfo and check in cpu Mz is same as advertised cpu speed. Laptops usually run in power saving mode. Not sure what is wrong with 64 bit Ubuntu 11.04, running on Lenovo Thinkpad T410. Even with power plugged in, it decided to run at 933Mz instead of 2.53 Gz.
Thanks to google for this link.

Saturday, August 20, 2011

I am not with Anna

Call me pessimist, but I don't see corruption going away any time soon.

Loose definition of corruption would be using your "position" for "personal gains" instead of doing what is "right".  The operating word here is "position". How much corrupt can a beggar be? How much corrupt can a government be? Clearly government holds such an important position that it is very likely to be corrupt.  It pays to be corrupt government.

How about Lokpal? Does Lokpal holds important position? Yes it does and in some sense it holds a position more powerful than the government itself. Does it pays to be a corrupt Lokpal? Handsomely. Almost as much as a corrupt government.  I kind of believe in Murphy's Law as much as I believe in gravity.

Lokpal is as susceptible to corruption as much is government or a government employee. The process of choosing Lokpal is similar to aristocracy.  The difference is instead of choosing the government itself we are using it to choose the Lokpal. In some sense what Anna is saying is that Aristocracy is a better form of governance than democracy.  Is it?

Solving the corruption problem needs fundamental understanding and play between the concepts of "position", "personal gains" and what is "right".

Position:  Power corrupts, absolute power corrupts absolutely. How can we change the power equation?  Democracy gives people the power to change the government every five years. Government gets absolute power for five years. Here is my take on better democracy Dynamic Government.

Personal Gains:  This is usually money but it could be anything that can lead to money in the future. Again it need not be personal. It could easily go to friends and family.  I believe that the concept of money is broken. See Reinventing Money.  India badly needs to have Inheritance Tax.  In some sense money is equivalent to "position" if the person in "position" is corrupt. Hence absolute money for long duration is equivalent to absolute power and will lead to corruption. See  If Money Expired And People Had Choice .

What is Right: This one is hard to crack. Moral is right according to religion. Legal is right according to Law. Rational is right according to reason - but who knows what we are missing and how logical our minds think. Religion and Law are created by mortals only and could be wrong at times. I don't know the answers here but this is the direction I am looking at - On Reality Truth And Levels Of Abstractions and The Idiot Religion

I think corruption is a design issue, design of human interactions, how they help each other to survive together.