elasticsearch-jdbc
elasticsearch-jdbc copied to clipboard
How to avoid "nested document structures may lead to repetitions of the same group."
Hello,
I have configured my river and they work like a charm. It is great. Although I have a complicated query so I end up with tons of duplications in the object like in your example: where association is there multiple times.
{ "blog" : { "name" : "Joe", "association" : [ { "name" : "John", "id" : "3917" }, { "name" : "John", "id" : "3917" }, { "name" : "John", "id" : "3917" } ], "published":"2014-01-06 00:00:00" } }
This is not a problem although I am wondering where I can extend perhaps your river to avoid this. The reason being is that I do loads of nested aggregations and my bucket numbers are incorrect because of the duplicate elements. Also my query is rather complicated so sometimes I have hundreds of elements duplicated which does not look nice.
Any hints? It would be fairly simple and I am happy to add it if you tell me at which point this could be implemented in the river ?
it should be possible to see if the same element is in the array and no need to add it again and again if it is there.
Thanks in advance.
The challenge is that JSON objects have no identity, it is not straightforward to detect equality between the {"name" : "John", "id" : "3917" } I am not sure how the "tabular stream to JSON" conversion can be configured to find object equivalence and how to define the equivalence properly. Maybe just string comparison of the serialized JSON object? But how about attribute ordering? This is rather complex but not unsolvable so I am thankful for any suggestions how to treat JSON object equivalence properly.
Hi,
Can't we add them 1st into a Set to make sure no duplicates than add them to the json ? I know it is over simplifying it but wondering. but we have the path in the object and have an individual Set for each path and render them into the object when all corresponding rows been processed.
Can you also give me some pointers ( to save time ) where in the code you are converting the ROW to JSON so that I can have a quick look.
Thanks
The tabular stream to JSON is an incremental procedure which adds a key/value pair to an existing map. See the merge() method in https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/main/java/org/xbib/elasticsearch/plugin/jdbc/PlainKeyValueStreamListener.java
Hi,
Thanks a lot !!! I guess I could use the https://github.com/jprante/elasticsearch-river-jdbc/blob/master/src/test/java/org/xbib/elasticsearch/river/jdbc/support/ValueListenerTests.java and testMultipleValues to test my ideas ?
All the tests are related, but I assume you use bracket notation, so testArrays() is very close.
Hi,
I have created a draft version which passes the tests and seem to be working for me fine. Please note this is a very quick POC to see. I could not fully understand "unfortunately" the builder around it. it is there just to see if the idea could actually work.
Do you think you would have 15 min sometime to have quick look ?
Thanks
Hi,
I am sorry I have forgotten to add my commit. https://github.com/istvano/elasticsearch-river-jdbc/commit/3ffdfd5735ab1357f5bae0d1d423259827038d47
So this solution will basically collapse multiple identical nodes into one? That would be really nice :+1:
Agreed. Just getting started with ES and the JDBC river and have run into this. Could really help clean up the data document being kept in ES. I haven't looked at the code base at all so I don't know if this would work, but you could you potentially treat a 'multiple_values[_id]' column as special to use the _id property to mark as unique. Sort of how the top level _id column is used to collapse top level joins. This may be 100% unfeasible with the current implementation, just thought I would throw the idea out there.
Well, if you feel like testing my version you can build it from my repo. It is not an IDEAL solution but does work for me.
I have the same issue. I've tested your patch and works fine!
I think it should be integrated into the master branch (after test had passed)
What about this issue? Could it be merged in master? If this feature may trigger undesired effects, maybe we could add a parameter to enable/disable it?
Thanks.
Hi,
not sure. I used this once for a project but no longer in production. thanks
Thanks for replied.
I tried to use your patch, then I read here that river will be deprecated in next ES releases. So I moved on to bulk API :/
Julien BENOIT PHP & JS developer @GuestToGuest https://twitter.com/GuestToGuest Paris, FR. blog.julienbenoit.net http://fr.linkedin.com/in/jbenoit2011 http://www.viadeo.com/profile/0021vk9j4z60jp63/ https://github.com/jbenoit2011
2015-09-23 14:47 GMT+02:00 istvano [email protected]:
Hi,
not sure. I used this once for a project but no longer in production. thanks
— Reply to this email directly or view it on GitHub https://github.com/jprante/elasticsearch-jdbc/issues/328#issuecomment-142587509 .
JDBC importer will be supported in the future.
River API is to be removed but that has nothing to to with bulk API, JDBC river or importer both uses bulk API.
It is open if JDBC importer will be able to run as a plugin in Elasticsearch 2.0 so the user can install the importer more easily.
If the patch is considered useful, I can test it and add it to the next version. Pull requests are always welcome.
Thanks @istvano, your "workaround" worked for me!
Hi Julien,
I think the patch is useful. I have used in production for a long while. Although it is not beautiful. I did not want to send a pull request as I am not sure about it. (was it implemented at the right place etc. )
I am really new to rivers and all that so I did not want to send low quality code into the code.
thanks
Having same Issue... Proposed solution is not added to master yet..right...? And as i am very late here is there any other configuration param that handles it..!
previously, I have below problem, could you help to solve? I have mapping
mappings: { properties: { sys_code: { type: "string", fielddata: { loading: "eager" } }, values_id: { type: "integer", doc_values: false }, values_name: { type: "string", fielddata: { loading: "eager" } } } }
and sql expression
SELECT
"'test'" as _index,
"type_primary" as _type,
sus.syouhin_sys_code AS "_id",
CAST(sus.sys_code AS CHAR) AS sys_code,
values_id AS "values_id[]",
values_name AS "values_name[]"
FROM syouhin_table sus
and my database result
sys_code:21554223 values_id["9460418","9460421","9462677","9462682","9462688","9464051","9464052","9464053"], values_name: ["test1","test2","test3","test4","test5","test6","test7","test3"],
but my index result is different with database ++++++++++++++++++++++++++++++++++++++++++++++++++ _index: "test", _type: "type_primary", _id: "21554223", _score: 1, _source: { sys_code: "21554223", values_id[ "9460418", "9460421", "9462677", "9462682", "9462688", "9464051", "9464052", "9464053" ], values_name: [ "test1", "test2", "test3", "test4", "test5", "test6", "test7" ], ++++++++++++++++++++++++++++++++++++++++++++++++++ It means that values_id (9462677 and 9464053) have the same values_name("test3"). because it was duplicated in [values_name(test3)], but I need it display in index like below
_id: "21554223", values_id[ 9462677 , 9464053] values_name: [ "test3", "test3"] how I can solve this problem? Please help thanks.
Hon