Safe Cassandra shutdown and restart

If you are a Cassandra user you’re probably experienced enough to know how to stop or restart Linux services – that’s an obvious thing. However in some cases it might be a problem if a service you turned off goes down, especially if other services have been using it. While Cassandra is very robust and crash-safe (pkill -9 cassandra works fine ;-) ), it’s never a bad idea to do things in a way that minimizes the risk of something going wrong. The other advantage of clean Cassandra restart procedure is saving some startup time. Here is how to do it.

The example below refers to a two-node cluster with nodes known as cssa01.michalski.im and cssa02.michalski.im:

root@cssa01:~# nodetool -h cssa01.michalski.im ring
Address Status  State   Load        Effective-Owership  Token                                       
                                                        85070591730234615865843651857942052864      
<ip1>   Up      Normal  318.21 GB   100.00%             0                                           
<ip2>   Up      Normal  318.2 GB    100.00%             85070591730234615865843651857942052864

So, first I will present the procedure, then – I will explain it:

root@cssa01:~# nodetool -h cssa01.michalski.im disablegossip
root@cssa01:~# nodetool -h cssa01.michalski.im disablethrift
root@cssa01:~# nodetool -h cssa01.michalski.im drain

Now Cassandra is ready to go down – that’s the best place to take a snapshot if you need it. Then you can safely restart it or shut it down – whatever you need.

You know what I do to shut down Cassandra node, now here is why:

  1. First we disable Gossip – it’s a protocol to discover location and state information about the other nodes in cluster. Every node “talks” to at most 3 other nodes in cluster. Not a lot, but the key here is that two nodes “talking” to each other exchange not only the information about themselves, but also about all the nodes they were in contact with, so the information can spread really fast across the cluster. Yes, this is why it was called “Gossip”. So why do we disable it? Because it makes node look like “dead” for other nodes. However, there’s one important exception that you should be aware of – because disabling Gossip does not disable Hinted Handoff sessions that has already started when Gossip was enabled, you might see a traffic on Gossip port after you turn it off (see: CASSANDRA-4162). See Gossip Architecture on Cassandra Wiki for more information about Gossip protocol.
  2. Then we disable Thrift – the lowest-level Cassandra interface exposed to “external” developers. Without getting too much into Cassandra internal architecture, it’s enough to say that turning it off makes Cassandra unable to accept user’s requests because it disables Cassandra’s RPC server.
  3. Last step is to perform drain – it flushes column families. In other words – converts Memtables into immutable SSTables, emptying Commit Log this way (here we get to the Compaction topic which is a different story for a different time). It’s not really necessary to do it, but if Commit Log is not empty, it will retry on startup all the writes it has stored, to make sure that no data are missing. After drain it’s not required to replay Commit Logs which can save you some time.

I mentioned that such procedure saves some startup time – to not be baseless, for the cluster I described such procedure has limited startup time on test environment from 63 to 55 minutes (tested twice) which is about 13% faster. Not a huge improvement (I think that on production it could be more efficient because of larger Commit Logs), but I think it’s worth running one more command during shutdown.

Although this procedure works for me really well, I’m really curious if there’s something I missed or if something can be done better – feel free to contact me if you have a different idea, I would love to know it!

Comments are closed.