elassandra _source field goes missing from Elassandra after nodetool rebuild

Elassandra version: elassandra-6.8.4.3

Plugins installed: []

JVM version (java -version): java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

OS version (uname -a if on a Unix-like system): Linux CMSNextDB3871 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: I have 4 datacenters of Cassandra and recently migrated to Elassandra. I did a nodetool rebuild_index recently and see lots of documents in Elassandra which don't have a corresponding record in Cassandra. All these documents don't have _source field.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Migrate existing Apache Cassandra v3.11.2.0 with around 40,000,000 documents, as per the steps shared in Elassandra documentation.
Created mapping for the fields to be indexed in ElasticSearch.
After 3-4 runs of nodetool rebuild_index --thread 16, got lots of documents without _source field and missing all fields except elasticsearch specific fields such as _id, _type, _index only

Please provide the following information:

elassandra logs (logs/system.logs or /var/lib/cassandra/system.log)
elasticsearch cluster state (curl http://localhost:9200/_cluster/state)
cassandra schema (cqlsh>DESC KEYSPACE <your_keyspace>)
cassandra gossip state (run: nodetool gossipinfo)

system.log cluster_status.log gossipinfo.log keyspace.log

Jun 11 '20 12:06 pankajydv

Please note the issue looks similar to https://github.com/strapdata/elassandra/issues/244, but it doesn't have a resolution.

Jun 11 '20 12:06 pankajydv

Please also note the count, http://localhost:9200/cmsentitydb/_count is significantly different across the 4 datacenters i.e. 43406958, 43458440, 43451846, 35910790

Jun 11 '20 12:06 pankajydv

Such situation usually happen when a row is expired at the Cassandra level, but was indexed before being expired. For results with empty _source, please check the underlying row exist by issuing a SELECT * FROM table where PK = _id.

Jun 11 '20 14:06 vroyer

@vroyer No the record doesn't exist on underlying Cassandra table. Actually that's the real issue we are getting wrong results from elassandra index and there are just too many such records. in elassandra index. Is there a way to get rid of all such documents from Elassandra?

Jun 11 '20 14:06 pankajydv

In that situation, you should delete the index, and re-create it to only index existing rows, or (2nd scenario) create a new index, and switch using an ES index alias. (index rebuild does not delete documents, it just reindex rows from SSTables on disk).

Just keep in mind that cassandra trigger a single-thread index build when the first index is created. So, in the 1rst scenario, if you want to rebuild quickly, you'll need on each node to kill the single-thread index rebuild (nodetool compactionstats + nodetool stop --compaction_id xxxx) and relauch a nodetool index_rebuild --threads 16 .... And in the second scenario, you'll need to launch the index rebuild...

Jun 11 '20 15:06 vroyer

@vroyer - Thanks for the quick response. The first approach is not an option for us because it's already being used in production. We'll got for the second approach.

But both of these approaches are time taking and don't resolve the issue quickly on production environment. It would be great if Elassandra can keep itself in sync with the Cassandra deletes, so that we don't face such issues on the live environment.

Jun 11 '20 16:06 pankajydv

Missing documents where probably removed by previous compactions. You can enable re-index on compaction to get the behaviour you expect, but it significantly increases cost of compaction, and it’s too late right now !

On 11 Jun 2020, at 18:10, Pankaj Yadav [email protected] wrote:

@vroyer https://github.com/vroyer - Thanks for the quick response. The first approach is not an option for us because it's already being used in production. We'll got for the second approach.

But both of these approaches are time taking and don't resolve the issue quickly on production environment. It would be great if Elassandra can keep itself in sync with the Cassandra deletes, so that we don't face such issues on the live environment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/strapdata/elassandra/issues/347#issuecomment-642779757, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOMPGJLMWP7SFHZTWWAND3RWD6XVANCNFSM4N3LQHYQ.

Jun 11 '20 21:06 vroyer

How can I achieve this any references would help: 'enable re-index on compaction'

Jun 12 '20 16:06 pankajydv

elassandra
elassandra copied to clipboard

_source field goes missing from Elassandra after nodetool rebuild_index

elassandra elassandra copied to clipboard

_source field goes missing from Elassandra after nodetool rebuild_index

elassandra
elassandra copied to clipboard