elasticsearch-jdbc
elasticsearch-jdbc copied to clipboard
Array data is duplicated when other fields have multiple values
We are having a problem with array data getting duplicated when other fields (in this case the email address) have more than one value. This is not a problem when the data in the array (home address) is not grouped into an array.
This river:
{
"type": "jdbc",
"jdbc":
{
"sql": "SELECT
person.person_id AS "_id", person.first_name AS "firstName",
person.middle_name AS "middleName",
person.last_name AS "lastName",
person_email.email_address AS "emailAddresses",
person_address.ADDRESS_1 AS "address[street]",
person_address.CITY_NAME AS "address[city]",
person_address.STATE_PROVINCE_NAME AS "address[state]",
person_address.POSTAL_CODE AS "address[zip]"
FROM person
LEFT JOIN person_email ON person_email.PERSON_ID = person.PERSON_ID
LEFT JOIN person_address ON person_address.PERSON_ID = person.PERSON_ID
ORDER BY person.PERSON_ID ASC",
"type": "person",
"index": "test",
"fetchsize" : 1000
}
}
With this data: _id firstName lastName email street city state zip 4289 Frank Smith [email protected] 123 Main St. Superior Wisconsin 54880 4289 Frank Smith [email protected] 123 Main St. Superior Wisconsin 54880
Results in this JSON: { "_index": "test", "_type": "person", "_id": "4289", "_version": 1, "found": true, "_source": { "firstName": "Frank", "middleName": null, "lastName": "Smith", "emailAddresses": [ "[email protected]", "[email protected]" ], "address": [ { "street": "123 Main St.", "city": "Superior", "state": "Wisconsin", "zip": "54880" }, { "street": "123 Main St.", "city": "Superior", "state": "Wisconsin", "zip": "54880" } ] } }
We have not discovered a way to prevent the array data duplication, but it becomes a significant issue when more data fields have several values like the email address - you get the product of all the fields that have more than one value!
Is this a known issue? Is there a workaround?
Thanks, Tony
This is the current behavior to possibly get duplicated fields. A workaround is to not update your array fields with the river (just use the river to get the base documents), and then use another process to index just the array-based fields.
I think the only possibly way for this project to fix this behavior is based on some convention (like _id field in each item of an array) de-duplicate the entries in each array.
@mradamlacey @jprante
This would all fallback to simply adding additional joined_entity.id to SELECT of array fields fixing the issue. I have just bumped into it myself. Hope to see that feature implemented.