Tuesday 20 May 2014

Clustering issues on SQLFire and RabbitMQ

I have been seeing much clustering issues in the last months on both RabbitMQ and SQLFire and both are Pivotal products which are opensource.
Seems like both products have issues with network latency that would cause Split brain issues in the cluster and could lead to potential data loss.

In order to be able to tell when such issues happen i have used the following approches to be able to tell if we have a cluster issue:

1- Integrate Hypric monitoring with SQLFire & RabbitMQ components.
2- For SQLFire, we can make use of the following system query:

 cat get_members.sql
select ID,KIND from sys.members order by KIND;

Running this query with commandline :
{HOME}/sf/sqlf run -client-bind-address=${HOSTNAME} -client-port=1527 -user=myapp -password=myapp -file=get_members.sql

Parsing this output would allow knowing the current number of cluster members.
if any split happens the output of this query will be differant.

3- For RabbitMQ, used a more radical way to do the monitoring.
RabbitMQ nodes will be always talking to each other, so the warning is based on the number of connections that each node has towords the sister node in the cluster:

    CON_COUNT=`ssh -q rmquser@rmqnode01 netstat -p 2>/dev/null|grep -i est |tr -s " "|cut -d" " -f5,7|grep rmqnode|cut -d"." -f1,4 --output-delimiter=" "|cut -d" " -f1,3 |sort |uniq -c|wc -l`

This will get the number of connections from rmqnode01 to all other cluster members.
The count should be number of clustermembers - 1

If the number is less, then we have a split brain issue.
Also  RabbitMQ management console tell you at once that there is an issue.

A future thing is to capture the warning from the RabbitMQ managment console directory.




No comments:

Post a Comment