Solandra Distributed Searching and Distributed Indexing in Solandra

Hi Solanrdra team. I created a 2 node Solandra cluster and measured query performance (select/update). I found out the 2 node performance is approximately the same as for 1 node one. Could you please clarify if Solandra supports Distributed Searching and Distributed Indexing currently? Are there special solutions for Solandra or it is OK to use Solr solutions, http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching?

Jun 20 '11 13:06 Kirill-Petersburg

Hi, Yes Solandra uses distributed searching internally to fulfill requests.

The solandra.shards.at.once property in solandra.properties allows paralell indexing across shards, so the default of 4 should work for you. Does your indexing code send add requests to both hosts using many threads?

Jun 20 '11 13:06 tjake

I tryed different techniques of sending add requests:

sending all requests to one ip (node ip) sequentially
sending requests from 10 different sources to one ip (node ip)
sending 2 different series of requests to two ip (2 node ip of the cluster) sequentially for each
sending 2 different series of requests to two ip (2 node ip of the cluster) using 10 different sources for each

For each experiment I measured data traffic. 1 and 2 gave about 20% perfomance gain compared to 1 node system. But there were some problems with 3 and 4 due to load balancing. There were different data traffics reached for nodes, e.g 600 doc/sec for 1-st (maximum) node and 150 doc/sec for 2-nd (maximum). Also node disc space usage was unsymmetrical: bin/nodetool -h node1_ip ring Address Status State Load Owns Token … node1_ip Up Normal 70.87 KB 19.39% node2_ip Up Normal 141.38 MB 80.61%

Jun 20 '11 16:06 Kirill-Petersburg

For the unbalanced cluster you need to pre-select your initial_token in cassandra or move the tokens around to make it more balanced:

http://www.datastax.com/docs/0.8/operations/clustering#calculating-tokens

This will put the data on both nodes

Jun 20 '11 17:06 tjake

Thank you Jake, I followed the instruction and succeeded in balancing the Cassandra nodes (50%/50%). Then I made 2 experiments:

sending add requests to one ip (node ip) sequentially
sending add requests to two ip’s of the 2 node cluster simultaneously.

The 2 node system gives a 20% performance gain compared to 1 node system in experiment #1. In experiment #2 the document traffic was stuck for one of the nodes (it was about 10 times less compared to traffic of another node). And the system performance was the same as in experiment #1.

So It seems only one node of the cluster is responsible for indexing. Am I right? Is it possible to combine CPU power of the two nodes?

Jun 21 '11 12:06 Kirill-Petersburg

Is the on disk size roughly even between the 2 nodes now?

It does index using both nodes however the problem is IO. Have you tried this experiment with the bulk loading URL? Add a ~ before the index name. This removes the read before write check needed for updates.

You won't get a 50% improvement going from 1 to 2 nodes but adding more nodes will add to overall indexing throughput

Jun 21 '11 13:06 tjake

The on disk size is not even: $ bin/nodetool -h node1_ip ring Address Status State Load Owns Token 85070591730234615865843651857942052864 node1_ip Up Normal 3.82 GB 50.00% 0 node2_ip Up Normal 1.11 GB 50.00% 85070591730234615865843651857942052864
I'm using the bulk loading. In the experiments I sent series of bulks containing 400-800 documents in one xml (the number is fixed during an experiment).
Could please explane how to add a ~ before the index name? I'm using cURL to insert an xml: URL=http://localhost:8983/solandra/solr_data/update curl $URL --data-binary @$1 -H 'Content-type:text/xml; charset=utf-8'

Jun 21 '11 13:06 Kirill-Petersburg

The bulk url would be http://localhost:8983/solandra/~solr_data/update

If you look at this url in the browser you can get info about which nodes the shard live on: http://localhost:8983/solandra/schema/solr_data

It looks like 3/4 of them are on one node vs 1/2... The shard placement is random so this is quite possible. The more nodes you have the less of a issue this becomes.

Jun 21 '11 14:06 tjake

Hi Jake- I added a 2nd node to an existing single node, 6M doc index. I saw some bootstrapping action, and your suggestion helped to get them to 50/50- thanks. However, index performance got worse (166 to 111 docs/sec) and query performance is basically the same (mostly well under 500ms). For both indexing/querying, I am doing serial requests going to 1 node (I understood from your earlier post that query distribution is handled internally anyway). For shards settings, I'm using all "out of the box" defaults. I'm more curious about the lack of query performance improvement. Can you suggest anything there? Do you expect much of a difference if I start with 2 nodes and an empty index?

Mar 20 '12 23:03 wwwill