elasticsearch-jdbc icon indicating copy to clipboard operation
elasticsearch-jdbc copied to clipboard

Timestamped index is duplicating data from previous timestamped index

Open ILikeToNguyen opened this issue 9 years ago • 0 comments

Scenario: I have a mysql table with a field named executiondate which I've set as my @timestamp. I want a new index created for each new month since the data is fairly large. I set my index as "index" : "'''mysql-'''YYYY-MM" which correctly creates a new index when the month rolled over.

Problem: When the month rolled over, all the data from mysql-2016-09 was also duplicated in mysql-2016-10. I didn't use the statefile option at the moment because the cloud compute instance I was using went down. I was relying on the _id to know not to duplicate the data. However, it looks like the behavior for that is, to check _id for the current index instead of the whole cluster. So now my rec_number 5 row from mysql table is duplicated in mysql-2016-09 and mysql-2016-10.

Question:

  1. Is there a way to have it treat id's as a unique entry across the whole cluster instead of per index? I thought this was the old behavior because I never used statefile as an option before and my data was never replicated, even after restarting jdbcriver. I know this wont happen if i use statefile option since it wont read the old data, but problem is my cloud boxes go bad everyonce in a while which I then need to create a new instance, and i wont have the statefile anymore, thus it will parse all rows in mysql and then I need to rely on _id for not duplicating the data.
  2. Slightly off topic question: Is there a way to have the indexing done based on the @timestamp field instead of the current time on the elasticsearch-jdbc? Problem is, sometimes we create a new elastic cluster for new sql tables and want to bring in a lot of old data. Today, all of the data is brought over into the current months index. For example all rows with @timestamp of August, Sept, Oct are put into the current months index, mysql-2016-10. It would be nice if there is an option to read the @timestamp field and elasticsearch-jdbc will write that to the corresponding index. So elasticsearch-jdbc would see that the @timestamp says August, so it will write(create if doesnt exist) into mysql-2016-08 instead of the current month index, mysql-2016-10.

Using 2.3.4.1 with ES 2.4.0. Pertinent data in my .sh file: { "type" : "jdbc", "schedule" : "0 * * * * ?", "statefile" : "statefile.json", "sql" : [ { "statement" : "select *, Rec_number as _id, executiondate as "@timestamp" from mytable where executiondate > ?", "parameter" : [ "$metrics.lastexecutionstart" ] } ], "index" : "'''mysql-'''YYYY-MM",

ILikeToNguyen avatar Oct 04 '16 16:10 ILikeToNguyen