Update LuceneQuerySource.cs
Fix issue #7665. Lucene document API now handles case where Lucene document contains duplicate fields names, representing a field that has multiple values. # #
Example of use case please.
To replicate the original bug prior to the update in this pull request (the exception is detailed in issue #7665):
- Create a taxonomy containing at least two terms.
- Create a ContentType with a taxonomy field using the above taxonomy that allows multiple terms to be selected.
- Create a ContentItem of the type in 2 above.
- Create a Lucene index including the ContentType in 2 above.
- Access the Lucene document API: /api/lucene/documents?indexName={name of lucene index from bullet 4}&query={%20"query":%20{%20"match_all":%20{}%20}%20}
Note that you may also need to set permissions so that you can access the Lucene document API from bullet 5.
Upon hitting the API the original code throws an exception because it does not handle the case when a lucene document contains more than one field with the same key, which appears to be the way multivalued taxonomy fields are added to the Lucene index. In short, the original code tries to add multiple JProperty objects having the same property name (i.e the name of the Lucene document field's key) to the JObject.
The updated code groups the Lucene document fields by field name and adds JPropery objects with appropriate values, i.e. either a string for groupings with only 1 value or IEnumerable
Hitting the API endpoint after the updates in the pull request are applied results in a successful serialization of the Lucene document. However the serialized json can now contains arrays of values so it is technically no longer in the typical Lucene format, which allows duplicate keys. It's possible there may be a better way to serialize this but I can't currently think of one.
So basically you are storing the JSON value in Lucene?
In this test example the taxonomy term ids are being stored (no json) and that is enough to break the API. For taxonomies this may not be all that useful since it's not storing the taxonomy term's text, but just the ids. But there are other cases where storing content field values might be useful and/or help with performance by not having to hit the database to pull back the entire contentitem when all you need is one or two field values, which might be multivalued, for use by a consumer of the API.
As an example, I'm currently creating a key/value pair field that allows multiple key/value pairs to be added to a content item. Storing those values can be useful for retrieving a list of content item titles and URLs along with the associated key/value pairs through the API without having to load the entire content item. I suspect there are many other use cases.
At the very least, since it is an option even for some built in fields, adding a stored multi-valued field to the index shouldn't cause an exception when hitting the Lucene document API. The pull request update ensures this doesn't happen and serializes the Lucene document in a way that makes sense to a consumer of json, while also capturing all of the stored data.
Alright thanks for the details. Will review it when I get time.
Can you check @Skrypt to see if this is still applicable and if we'd want to merge it after fixing the merge conflict? And do we need to do something similar for Elasticsearch too?
I've assigned myself to review this one. I need to test it to see how it will improve or break actual indexing.
Thanks!
This pull request has merge conflicts. Please resolve those before requesting a review.
Any news, @Skrypt?
Yeah, sorry, the news is I'm busy with Media Gallery and trying to find time to think about these when not fixing other bugs :smile:
OK, please check it out when you can :).
@Skrypt could you check on this, please?
It seems that this pull request didn't really move for quite a while. Is this something you'd like to revisit any time soon or should we close? Please comment if you'd like to pick it up.
Closing this pull request because it has been stale for very long. If you think this is still relevant, feel free to reopen it.