streamline
streamline copied to clipboard
SAM event sampling search by key should support wildcard
Right now the "Search By Key" seem to do an exact match and the user needs to provide the full event text to be able to search.
We need to enable wild card search here to make it more useful.
It doesn't need full text to search. I have working integration test which searches by "word" in doc. (Please refer internal integration tests.)
But you need to be aware that it also doesn't ensure every word or sentence is searchable. It is just up to how Solr breaks the doc into tokens and indexes them. I'm not sure how exactly Solr is tokenizing, but I have some experiences in search service.
If we would want to know more details, we need to take a look at https://lucene.apache.org/solr/guide/6_6/understanding-analyzers-tokenizers-and-filters.html
And we just pass search string to Solr, which means Solr is not treating it as just a text, but a query. It will support Solr's features like boolean search and wildcard.
So I have to say there is no way to guarantee the doc is searched when we query with substring. The search engine doesn't guarantee exact match with substring. We should live with it while we leverage search engine.
@HeartSaVioR , Not sure about the default behavior of solr. But I was never able to see result if I entered substrings in "Search by Key". However I think I saw results even while searching with partial ids in the "Search by Id".
We use wildcard here but not here. Does it matter ? Do we need any special setting to do search on partial values by default ?
Assuming you're ingesting truck events: please search "KC Via", "KC Via Columbia", "KC Via Hanibal" which are working for me.
Boolean search needs to be guarded with () like ("KC Via " AND "Hanibal").
It looks like english tokenizer basically splits docs by word, meaning splitting by space, not non-alphabet characters.
Please search with routeName=Saint and routeName=. Latter only returns events which KV has routeName= blabla (NOTE: space). Events which KV has routeName=blabla are not searched from latter query.
While truckId=74 doesn't provide any result, truckId=74, provides result, and truckId=74* also provides result too.
Please note that * applies to search term (token), not whole content.
https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html
To summarize, if we want to let users enjoy full features on Solr, we shouldn't touch the search string. If we just want to restrict the query string to simple search, we could wrap search string with double quote and *.
@arunmahadevan Which one do we want to provide to users?
So finding event id from root ids (and parent ids) needs to be wrapped with * because it will contain , at the end in json representation. Solr knows that fields are text whereas they are text but json format. We should be aware of that and handle accordingly.
Assuming you're ingesting truck events: please search "KC Via", "KC Via Columbia", "KC Via Hanibal" which are working for me.
I think the above would work only if these are separated by whitespace. E.g querying for "KC Via", driverName = KC Via would match but not driverName=KC Via.
I dont think we can expect users to figure out the solr syntax and how solr tokenizes the log.
IMO, for the search by key
- We enable wild card search by default. i.e. if the user enters
foowe internally query for*foo*,foo bar->*foo bar*and so on. - Escape the special chars. From the solr docs this is
+ - && || ! ( ) { } [ ] ^ " ~ * ? : /. These needs to be prefixed with\. - Provide some settings or checkbox to enable "Use solr syntax". In this case we disable the wildcard and use the literal query. Here we also don't escape any characters and expect the user to do so.
For the search by ID:
- Automatically add wild card for the event Id as well. Right now we show results for partial Ids for parent id and root ids but not for event id, which is not consistent.
- If "Use solr syntax" is turned on, we dont do anything.