xp
xp copied to clipboard
Stemming only works on allText in content layer
I have tried https://github.com/ComLock/app-stemming-example on XP 7.6.1 and stemming doesn't work.
Current docs I can find: https://developer.enonic.com/docs/xp/stable/storage/indexing#stemmed https://developer.enonic.com/docs/xp/stable/storage/noql#stemmed
Future doc supposed to work: https://developer.enonic.com/docs/xp/next/storage/indexing#languages
https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L13-L18 fulltext: true, // Needed for stemming? includeInAllText: true, // Needed for stemming? languages: ['no'] stemmed: true // Not reflected in node, nor in documentation, so this some core dev must have told me?
https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L114
query: stemmed('_allText', '${word}', 'OR', 'no')
When using data toolbox I can see there is no _alltext._stemmed_no or any other fieldName._stemmed_no under Display Search Index Document
Also when exporting and inspecting node.xml the node.xml doesn't look anything like the _indexConfig. So export format is wastly different from node JSON. Why???
When making a normal content site and setting language to "no" there is a _alltext._stemmed_no But when looking at the node JSON there are no languages set under _indexConfig!
This is what I think makes stemming work for content in the exported node.xml
<allTextIndexConfig>
<languages>
<language>no</language>
</languages>
</allTextIndexConfig>
-
There is a string in the new doc: " While setting the language for the content will only index the _allText field, setting the languages in the node config will create stemmed indices for all mapped properties. See node create function. " Which means we don't create stemmed
_allText
index for simple nodes, only for contents. Sostemmed('_allText')
function will return nothing for node. -
The is a lie in the new doc: custom index cannot be created for the
_name
field, it's always fulltext. So it can't be stemmed. Doc will be fixed. (Most of def node properties, starts with "_" have predefined indices, so _indexConfig won't affect it) -
I don't know why XML export format diffs from JSON lib format, looks like XML was migrated from elder Enonic XP versions and JSON was introduced later. Not all things which good for JSON are always suitable for XML.
Documentation issue created https://github.com/enonic/doc-xp/issues/307
@sigdestad Can you have a look at this. If I understand correctly nothing has changed, stemming is still impossible for the node layer?
@rymsha something is obviously not working as expected. Could we arrange a meeting to clarify things related to stemming for med and CWE?
@ComLock from what I see here, you are still trying to stem the _name
field. This will not work, as Slava described in the comment above (and this is what he fixed in the docs too).
If you have an example where stemming doesn't work for a field that is supposed to work, please commit this to your app's repo and we will look at the code.
I have now updated the example, and I can see there is a stemmed index , but still no hits. https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L47
"property._stemmed_no": [
"havnedistriktene"
],
Are there automatic regression tests for stemming on the node layer somewhere, it would be nice to look at some working example code. (since my example code has flaws in it)
custom index cannot be created for the _name field, it's always fulltext
What does that mean? You can't both have a fulltext and stemmed index of a field? So I have to set fulltext to false in indexConfig, in order for stemming to work? Nah it works with fulltext: true
Got it working, looking into why. Maybe connection.refresh();
So stemmed('_allText') function will return nothing for node.
@sigdestad If I understand correctly even though I say includeInAllText: true on some field, there will be no _alltext._stemmed_no. I can live with that in explorer. But might be something we want in the future?
So, includeInAllText just indicates if a property should be included when creating the _allText "virtual field". What we need is that _allText get's stemmed.
Something like: If any field indexConfig (including default) both have includeInAllText: true and any lang in languages: [] then there needs to be created an _alltext._stemmed_LANG per LANG
includeInAllText
and languages
are separate configs: includeInAllText
indicates if mapped field/fields value will be add to _allText
index and languages
sets array of stemmed indices for only mapped field/fields. Adding stemmed index for special _allText
field was designed as a content level thing only for single language
content field value.
So if you create a node by node-lib then the only way to add stemmed indices is to set it for particular fields with languages
property for now.
But doesn't content API use node API directly? How can it do something else than what node api supports?
Basically, what I would like to know is how to specify that -_allText should be stemmed?
yes, content API uses node API, but _allText stemming feature is done in the inner core-content module, and this functionality wasn't been open neither in content or node js lib. It was decided to make it a content-specific feature. So there is no way to influence _allText
indexing directly for now.
Ok, so we basically need to define a proper solution for this in the node API, implement and document it.
There is at least one known "problem" with this in the export/import xml that we also need to look into
It is still not exactly clear what to do...
- proper solution for "this", but what is "this"? Do we want stemming on other fields?
- what is exactly wrong with import/export? does it not preserve stemming configuration?
this -> There is no way of specifying stemming for _allText via node API import/export -> currently uses an undocumented obscure format for transferring stemming info (at least for _allText) - Need to consider if it is breaking or not...
I don't care that much about the import/export syntax.
But since Explorer 4 uses _alltext as the default search, it would be nice if stemming of _alltext on the node layer worked from the get go without people having to boost/query specific stemmed fields in order for stemming to work...
So I think the label of Documentation is misleading. For me this is rather a Feature Request or similar.