elasticsearch-jdbc icon indicating copy to clipboard operation
elasticsearch-jdbc copied to clipboard

Array data is duplicated when other fields have multiple values

Open tmolin opened this issue 10 years ago • 2 comments

We are having a problem with array data getting duplicated when other fields (in this case the email address) have more than one value. This is not a problem when the data in the array (home address) is not grouped into an array.

This river: { "type": "jdbc", "jdbc":
{ "sql": "SELECT person.person_id AS "_id", person.first_name AS "firstName", person.middle_name AS "middleName", person.last_name AS "lastName", person_email.email_address AS "emailAddresses", person_address.ADDRESS_1 AS "address[street]", person_address.CITY_NAME AS "address[city]", person_address.STATE_PROVINCE_NAME AS "address[state]", person_address.POSTAL_CODE AS "address[zip]" FROM person LEFT JOIN person_email ON person_email.PERSON_ID = person.PERSON_ID LEFT JOIN person_address ON person_address.PERSON_ID = person.PERSON_ID ORDER BY person.PERSON_ID ASC", "type": "person", "index": "test", "fetchsize" : 1000 } }

With this data: _id firstName lastName email street city state zip 4289 Frank Smith [email protected] 123 Main St. Superior Wisconsin 54880 4289 Frank Smith [email protected] 123 Main St. Superior Wisconsin 54880

Results in this JSON: { "_index": "test", "_type": "person", "_id": "4289", "_version": 1, "found": true, "_source": { "firstName": "Frank", "middleName": null, "lastName": "Smith", "emailAddresses": [ "[email protected]", "[email protected]" ], "address": [ { "street": "123 Main St.", "city": "Superior", "state": "Wisconsin", "zip": "54880" }, { "street": "123 Main St.", "city": "Superior", "state": "Wisconsin", "zip": "54880" } ] } }

We have not discovered a way to prevent the array data duplication, but it becomes a significant issue when more data fields have several values like the email address - you get the product of all the fields that have more than one value!

Is this a known issue? Is there a workaround?

Thanks, Tony

tmolin avatar Jan 29 '15 01:01 tmolin

This is the current behavior to possibly get duplicated fields. A workaround is to not update your array fields with the river (just use the river to get the base documents), and then use another process to index just the array-based fields.

I think the only possibly way for this project to fix this behavior is based on some convention (like _id field in each item of an array) de-duplicate the entries in each array.

mradamlacey avatar Jul 23 '15 15:07 mradamlacey

@mradamlacey @jprante This would all fallback to simply adding additional joined_entity.id to SELECT of array fields fixing the issue. I have just bumped into it myself. Hope to see that feature implemented.

Antoniossss avatar Mar 31 '17 10:03 Antoniossss