elasticsearch-jdbc icon indicating copy to clipboard operation
elasticsearch-jdbc copied to clipboard

duplicates of documents getting created in jdbc river index when pulling data from mysql database

Open PavanRGowda opened this issue 10 years ago • 5 comments

I'm pulling data from my database whose table name is Categories which has columns Cat_id and Cat_name. I currently have 6 rows as my test data and started river instance with the following configuration with POSTMAN HTTP client -

 http://localhost:9200/_river/test/_meta 
   {
            "type" : "jdbc",
            "jdbc" : {
                    "url" : "jdbc:mysql://localhost:3306/elastic",
                    "user" : "root",
                    "password" : "password",
                    "sql" : "select Cat_id from Categories",
                    "index" : "sample",
                    "schedule" : "0/10 * * ? * *",
                    "autocommit" : true

             }

} The problem is, as I've scheduled for 10 seconds, query runs for every 10 seconds and duplicates of data is getting created in the index. That is, if my test data is -

   Cat_id      Cat_name
     1          Books
     2          Mobiles
     3          Laptops

My resulting index would get updated for every 10 seconds and same data gets added for every 10 seconds. That is, copies of data would be created.

I want only changes of data to be added to my index while no duplicates are created for my existing data ( Ex: Books,Mobiles,Laptops ).

PavanRGowda avatar Jun 16 '15 07:06 PavanRGowda

Please use _id pseudo column in SQL statement to create Elasticsearch document ids. These ids can be overwritten next run.

jprante avatar Jun 16 '15 16:06 jprante

Thank you. It did solve my problem partially. When I add a new row in my database, changes are reflected in my index. That is, new data gets added to the index but when I delete the row in my database, the data that was deleted still remain in my index.Any suggestions on how to update the index when indexed row of database is deleted?

PavanRGowda avatar Jun 16 '15 17:06 PavanRGowda

You have to configure an SQL select statement to select the ids to delete and pass an _optype column with value delete.

jprante avatar Jun 16 '15 18:06 jprante

You mean to say that I should keep track of ids that get deleted in my database and then configure my SQL query to delete the documents in my index? Could you please explain me with an example.

PavanRGowda avatar Jun 17 '15 05:06 PavanRGowda

Dearc jprante, I have the same problem however I used id pseudo column. My index is using array object to making index. my mapping

mappings: { properties: { sys_code: { type: "string", fielddata: { loading: "eager" } }, values_id: { type: "integer", doc_values: false }, values_name: { type: "string", fielddata: { loading: "eager" } } } }

and my sql

SELECT
"'test'" as _index, "type_primary" as _type, sus.syouhin_sys_code AS "_id", CAST(sus.sys_code AS CHAR) AS sys_code, values_id AS "values_id[]",
values_name AS "values_name[]" FROM syouhin_table sus

============================== result as below

_index: "test",
_type: "type_primary",
_id: "21554223",
_score: 1,
_source: {
sys_code: "21554223",
values_id[
"9460418",
"9460421",
"9462677",
"9462682",
"9462688",
"9464051",
"9464052",
"9464053"
],
values_name: [
"test1",
"test2",
"test3",
"test4",
"test5",
"test6",
"test7"
]

in values_id, I had 8 rows, but values_name just had 7 rows. reason is :values_id : 9462677 and 9464053 have the same values_name "test3". I think jdbc importer keep the duplicated data and did not make index for 9464053 as above case.

Could you help me to solve above problem ? I am using jdbc importer 2.3.4 for Elasticsearch 2.3.4 version.

thank you so much.

tomochabt avatar Feb 02 '17 06:02 tomochabt