Distributed Searching and Distributed Indexing in Solandra
Hi Solanrdra team. I created a 2 node Solandra cluster and measured query performance (select/update). I found out the 2 node performance is approximately the same as for 1 node one. Could you please clarify if Solandra supports Distributed Searching and Distributed Indexing currently? Are there special solutions for Solandra or it is OK to use Solr solutions, http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching?
Hi, Yes Solandra uses distributed searching internally to fulfill requests.
The solandra.shards.at.once property in solandra.properties allows paralell indexing across shards, so the default of 4 should work for you. Does your indexing code send add requests to both hosts using many threads?
I tryed different techniques of sending add requests:
- sending all requests to one ip (node ip) sequentially
- sending requests from 10 different sources to one ip (node ip)
- sending 2 different series of requests to two ip (2 node ip of the cluster) sequentially for each
- sending 2 different series of requests to two ip (2 node ip of the cluster) using 10 different sources for each
For each experiment I measured data traffic. 1 and 2 gave about 20% perfomance gain compared to 1 node system. But there were some problems with 3 and 4 due to load balancing. There were different data traffics reached for nodes, e.g 600 doc/sec for 1-st (maximum) node and 150 doc/sec for 2-nd (maximum). Also node disc space usage was unsymmetrical: bin/nodetool -h node1_ip ring Address Status State Load Owns Token … node1_ip Up Normal 70.87 KB 19.39% node2_ip Up Normal 141.38 MB 80.61%
For the unbalanced cluster you need to pre-select your initial_token in cassandra or move the tokens around to make it more balanced:
http://www.datastax.com/docs/0.8/operations/clustering#calculating-tokens
This will put the data on both nodes
Thank you Jake, I followed the instruction and succeeded in balancing the Cassandra nodes (50%/50%). Then I made 2 experiments:
- sending add requests to one ip (node ip) sequentially
- sending add requests to two ip’s of the 2 node cluster simultaneously.
The 2 node system gives a 20% performance gain compared to 1 node system in experiment #1. In experiment #2 the document traffic was stuck for one of the nodes (it was about 10 times less compared to traffic of another node). And the system performance was the same as in experiment #1.
So It seems only one node of the cluster is responsible for indexing. Am I right? Is it possible to combine CPU power of the two nodes?
Is the on disk size roughly even between the 2 nodes now?
It does index using both nodes however the problem is IO. Have you tried this experiment with the bulk loading URL? Add a ~ before the index name. This removes the read before write check needed for updates.
You won't get a 50% improvement going from 1 to 2 nodes but adding more nodes will add to overall indexing throughput
- The on disk size is not even: $ bin/nodetool -h node1_ip ring Address Status State Load Owns Token 85070591730234615865843651857942052864 node1_ip Up Normal 3.82 GB 50.00% 0 node2_ip Up Normal 1.11 GB 50.00% 85070591730234615865843651857942052864
- I'm using the bulk loading. In the experiments I sent series of bulks containing 400-800 documents in one xml (the number is fixed during an experiment).
- Could please explane how to add a ~ before the index name? I'm using cURL to insert an xml: URL=http://localhost:8983/solandra/solr_data/update curl $URL --data-binary @$1 -H 'Content-type:text/xml; charset=utf-8'
The bulk url would be http://localhost:8983/solandra/~solr_data/update
If you look at this url in the browser you can get info about which nodes the shard live on: http://localhost:8983/solandra/schema/solr_data
It looks like 3/4 of them are on one node vs 1/2... The shard placement is random so this is quite possible. The more nodes you have the less of a issue this becomes.
Hi Jake- I added a 2nd node to an existing single node, 6M doc index. I saw some bootstrapping action, and your suggestion helped to get them to 50/50- thanks. However, index performance got worse (166 to 111 docs/sec) and query performance is basically the same (mostly well under 500ms). For both indexing/querying, I am doing serial requests going to 1 node (I understood from your earlier post that query distribution is handled internally anyway). For shards settings, I'm using all "out of the box" defaults. I'm more curious about the lack of query performance improvement. Can you suggest anything there? Do you expect much of a difference if I start with 2 nodes and an empty index?