Background

At TapSense, we use a long-running memcached cluster(s) to cache the data contained in ad requests. This data is populated at the time we fill an ad request and later retrieved during impressions and clicks. One of our clusters had grown over time to 7 nodes. We observed that when compared to the 3 new nodes (that we brought up only a month ago), the 4 old nodes saw 10x more GET misses (about 20-30x more during peak hours), even though (i) each node was similar, (ii) each node was being used similarly, (iii) each entry was set to expire after 2 days.

 

Investigation

We looked at the number of items in each node. We discovered the old nodes had more elements, which is counter-intuitive to them having more GET misses. Then we remembered that memcached had slabs and the LRU only applies within each slab. Different sized entries use different slabs. That led us to speculate that perhaps all the new entries were significantly bigger and going to a different slab. Since the old entries in the smaller slabs were never being evicted, it was just dead space.

 

Solution

Memcached does not allow an easy way to purge items based on the age (at least, we are not aware). So we decided to reboot the nodes, one-by-one. It was also a good way to test if our database backup would hold in case a memached node goes down.

 

Thankfully, it all worked out, as evidenced by the graph below. As you can see from the graph, the misses increased as we rebooted each node but then gradually decreased. In addition, when a node was down, the next node in the sequence saw more misses (as you would expect when using consistent hashing). And then we rebooted the next node.

 

 

In the end, all the nodes started behaving similarly, as you would expect.