elasticsearch-jdbc icon indicating copy to clipboard operation
elasticsearch-jdbc copied to clipboard

How to avoid "nested document structures may lead to repetitions of the same group."

Open istvano opened this issue 10 years ago • 19 comments

Hello,

I have configured my river and they work like a charm. It is great. Although I have a complicated query so I end up with tons of duplications in the object like in your example: where association is there multiple times.

{ "blog" : { "name" : "Joe", "association" : [ { "name" : "John", "id" : "3917" }, { "name" : "John", "id" : "3917" }, { "name" : "John", "id" : "3917" } ], "published":"2014-01-06 00:00:00" } }

This is not a problem although I am wondering where I can extend perhaps your river to avoid this. The reason being is that I do loads of nested aggregations and my bucket numbers are incorrect because of the duplicate elements. Also my query is rather complicated so sometimes I have hundreds of elements duplicated which does not look nice.

Any hints? It would be fairly simple and I am happy to add it if you tell me at which point this could be implemented in the river ?

it should be possible to see if the same element is in the array and no need to add it again and again if it is there.

Thanks in advance.

istvano avatar Sep 03 '14 09:09 istvano

The challenge is that JSON objects have no identity, it is not straightforward to detect equality between the {"name" : "John", "id" : "3917" } I am not sure how the "tabular stream to JSON" conversion can be configured to find object equivalence and how to define the equivalence properly. Maybe just string comparison of the serialized JSON object? But how about attribute ordering? This is rather complex but not unsolvable so I am thankful for any suggestions how to treat JSON object equivalence properly.

jprante avatar Sep 03 '14 11:09 jprante

Hi,

Can't we add them 1st into a Set to make sure no duplicates than add them to the json ? I know it is over simplifying it but wondering. but we have the path in the object and have an individual Set for each path and render them into the object when all corresponding rows been processed.

Can you also give me some pointers ( to save time ) where in the code you are converting the ROW to JSON so that I can have a quick look.

Thanks

istvano avatar Sep 03 '14 12:09 istvano

The tabular stream to JSON is an incremental procedure which adds a key/value pair to an existing map. See the merge() method in https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/main/java/org/xbib/elasticsearch/plugin/jdbc/PlainKeyValueStreamListener.java

jprante avatar Sep 03 '14 13:09 jprante

Hi,

Thanks a lot !!! I guess I could use the https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java and testMultipleValues to test my ideas ?

istvano avatar Sep 03 '14 13:09 istvano

All the tests are related, but I assume you use bracket notation, so testArrays() is very close.

jprante avatar Sep 03 '14 15:09 jprante

Hi,

I have created a draft version which passes the tests and seem to be working for me fine. Please note this is a very quick POC to see. I could not fully understand "unfortunately" the builder around it. it is there just to see if the idea could actually work.

Do you think you would have 15 min sometime to have quick look ?

Thanks

istvano avatar Sep 17 '14 20:09 istvano

Hi,

I am sorry I have forgotten to add my commit. https://github.com/istvano/elasticsearch-river-jdbc/commit/3ffdfd5735ab1357f5bae0d1d423259827038d47

istvano avatar Sep 21 '14 13:09 istvano

So this solution will basically collapse multiple identical nodes into one? That would be really nice :+1:

dan-lind avatar Sep 24 '14 09:09 dan-lind

Agreed. Just getting started with ES and the JDBC river and have run into this. Could really help clean up the data document being kept in ES. I haven't looked at the code base at all so I don't know if this would work, but you could you potentially treat a 'multiple_values[_id]' column as special to use the _id property to mark as unique. Sort of how the top level _id column is used to collapse top level joins. This may be 100% unfeasible with the current implementation, just thought I would throw the idea out there.

dhensche avatar Sep 24 '14 13:09 dhensche

Well, if you feel like testing my version you can build it from my repo. It is not an IDEAL solution but does work for me.

istvano avatar Sep 25 '14 11:09 istvano

I have the same issue. I've tested your patch and works fine!

I think it should be integrated into the master branch (after test had passed)

CriztianiX avatar Nov 18 '14 18:11 CriztianiX

What about this issue? Could it be merged in master? If this feature may trigger undesired effects, maybe we could add a parameter to enable/disable it?

Thanks.

improved-broccoli avatar Sep 22 '15 20:09 improved-broccoli

Hi,

not sure. I used this once for a project but no longer in production. thanks

istvano avatar Sep 23 '15 12:09 istvano

Thanks for replied.

I tried to use your patch, then I read here that river will be deprecated in next ES releases. So I moved on to bulk API :/

Julien BENOIT PHP & JS developer @GuestToGuest https://twitter.com/GuestToGuest Paris, FR. blog.julienbenoit.net http://fr.linkedin.com/in/jbenoit2011 http://www.viadeo.com/profile/0021vk9j4z60jp63/ https://github.com/jbenoit2011

2015-09-23 14:47 GMT+02:00 istvano [email protected]:

Hi,

not sure. I used this once for a project but no longer in production. thanks

— Reply to this email directly or view it on GitHub https://github.com/jprante/elasticsearch-jdbc/issues/328#issuecomment-142587509 .

improved-broccoli avatar Sep 23 '15 13:09 improved-broccoli

JDBC importer will be supported in the future.

River API is to be removed but that has nothing to to with bulk API, JDBC river or importer both uses bulk API.

It is open if JDBC importer will be able to run as a plugin in Elasticsearch 2.0 so the user can install the importer more easily.

If the patch is considered useful, I can test it and add it to the next version. Pull requests are always welcome.

jprante avatar Sep 23 '15 14:09 jprante

Thanks @istvano, your "workaround" worked for me!

rodrigoma avatar Nov 23 '15 16:11 rodrigoma

Hi Julien,

I think the patch is useful. I have used in production for a long while. Although it is not beautiful. I did not want to send a pull request as I am not sure about it. (was it implemented at the right place etc. )

I am really new to rivers and all that so I did not want to send low quality code into the code.

thanks

istvano avatar Dec 01 '15 11:12 istvano

Having same Issue... Proposed solution is not added to master yet..right...? And as i am very late here is there any other configuration param that handles it..!

luckypur avatar Nov 29 '16 10:11 luckypur

previously, I have below problem, could you help to solve? I have mapping

mappings: { properties: { sys_code: { type: "string", fielddata: { loading: "eager" } }, values_id: { type: "integer", doc_values: false }, values_name: { type: "string", fielddata: { loading: "eager" } } } }

and sql expression

SELECT
"'test'" as _index, "type_primary" as _type, sus.syouhin_sys_code AS "_id", CAST(sus.sys_code AS CHAR) AS sys_code, values_id AS "values_id[]",
values_name AS "values_name[]" FROM syouhin_table sus

and my database result


sys_code:21554223 values_id["9460418","9460421","9462677","9462682","9462688","9464051","9464052","9464053"], values_name: ["test1","test2","test3","test4","test5","test6","test7","test3"],


but my index result is different with database ++++++++++++++++++++++++++++++++++++++++++++++++++ _index: "test", _type: "type_primary", _id: "21554223", _score: 1, _source: { sys_code: "21554223", values_id[ "9460418", "9460421", "9462677", "9462682", "9462688", "9464051", "9464052", "9464053" ], values_name: [ "test1", "test2", "test3", "test4", "test5", "test6", "test7" ], ++++++++++++++++++++++++++++++++++++++++++++++++++ It means that values_id (9462677 and 9464053) have the same values_name("test3"). because it was duplicated in [values_name(test3)], but I need it display in index like below

_id: "21554223", values_id[ 9462677 , 9464053] values_name: [ "test3", "test3"] how I can solve this problem? Please help thanks.

Hon

tomochabt avatar Feb 02 '17 10:02 tomochabt