OrchardCore Elasticsearch feature :sparkles::boom: (Lombiq Technologies: OSOE-83)

Fixes #4316

How to use:

Install Elasticsearch 7.x with Docker compose

Elasticsearch v7.17.5 Docker Compose file :

Copy this file in a folder named Docker somewhere safe.
Rename this file extension as .yml instead of .txt.
Open up a Terminal or Command Shell in this folder.
Execute docker-compose up to deploy Elasticsearch containers.

Advice: don't remove this file from its folder if you want to remove all their containers at once later on in Docker desktop.

You should get this result in Docker Desktop app :

Set up Elasticsearch in Orchard Core

Add Elastic Connection in the shell configuration (OrchardCore.Cms.Web appsettings.json file)

"OrchardCore_Elastic": {
    "ConnectionType": "SingleNodeConnectionPool",
    "Url": "http://localhost",
    "Ports": "9200"
}

Start an Orchard Core instance with VS Code debugger
Go to Orchard Core features, Enable Elasticsearch.

Implementation details

Analyzed and Stored Properties are not very meaningful in context of Elasticsearch.

Analyzed

Analyzed is default for strings in Elastic Search. By default all string fields are stored twice in elastic as "analyzed" and stored as "text" field type of elastic and again stored as is as in a "keyword" field type of elastic.

So we will have a field called ContentItemId(Text) analyzed and another called ContentItemId.Keyword(as is as) to match on exact values using TermQuery for fields like ContentItemId or emails (Elastic Stores text fields in 2 fields analyzed vs not analyzed, a field ContentItemId.Keyword is created automatically)

Elasticsearch documentation: https://www.elastic.co/blog/strings-are-dead-long-live-strings https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-index-search-time.html

Stored

Stored is really an overhead and only required if we are processing thousands of large documents. By default Elastic will store the entire document into a field called _source and retrieves them when asked them from Index itself.

Elasticsearch documentation: https://www.elastic.co/guide/en/elasticsearch/reference/7.12/search-fields.html

DSL Query Syntax

It is suggested to always use MatchQuery instead TermQuery for text fields in Elastic, where fully confident use (.Keyword) fields for exact match with TermQuery. (e.g. matching id, or fields like email, phone number, hostname)

Elasticsearch documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

NEST CheatSheet:

https://github.com/mjebrahimi/Elasticsearch-NEST-CheatSheet-Tutorials/blob/master/README.md

TODO

[x] Refactor Search form to support both Lucene and Elastic. :construction:
[x] Restructure project names using OrchardCore.Search.X :recycle: :truck:
[x] Add Advance Support for multilingual search. (Equivalent of Lucene Analyzers) :sparkles:
[x] Refactor UI to hide Analyzed as Indexing Options for Fields as it is now the default for Lucene too. :recycle: :lipstick:
[x] Refactor ContentIndexSettings as IContentIndexSettings : See https://github.com/OrchardCMS/OrchardCore/pull/10515 (Will not do)
[x] Move ContentIndexSettings drivers to the OrchardCore.Indexing module so that they are common to every Search provider.
[x] Replace Query type with Nest.IQuery in the Elastic Search module. We should not parse the Queries with the OC Lucene Parser. Nest.IQuery is on par with Elasticsearch Query DSL.
[x] Recipe backward compatibility refactor. (migration) "not necessary anymore"
[x] UserName/Password configurations
[x] Fix ElasticContentPickerResultProvider by adding a method in the ElasticIndexManager that allows to delegate the _elasticClient and thus allowing using fluent Queries.
[x] Add Lucene "keyword" implementation.
[x] Fix Elasticsearch document Id mapping (was not able to remove older document by Id when publishing a new ContentItem).
[x] Add Advance Elastic Configuration. (Cluster connection strings) - Optional
[x] Configure error handling. See : https://github.com/elastic/elasticsearch-net/issues/1793
[x] Change how we index custom fields like "Normalized, Sanitized, Inherited" by using an underscore instead of a dot as a separator else Elasticsearch fails to bind the data because of already used Mapping.
[x] Add an index setting for allowing to store "_source" on the index for Elasticsearch. Defaults to true.
[x] Using Elasticsearch deserializer to parse JSON queries
[x] Make Analyzers configuration work on each index.
[x] Background Task. Indexing state (Last Task Id) persisted in each Elasticsearch index.
[x] Secure ElasticIndexManager methods to operate only on current tenant indices.
[x] Module documentation. :memo:

Migration

Manual migration to apply to get back Lucene Indices Settings and Queries.

  UPDATE Document SET Content = REPLACE(content, '"$type":"OrchardCore.Lucene.LuceneQuery, OrchardCore.Lucene"', '"$type":"OrchardCore.Search.Lucene.LuceneQuery, OrchardCore.Search.Lucene"')
  WHERE  [Type] = 'OrchardCore.Queries.Services.QueriesDocument, OrchardCore.Queries'

  UPDATE Document SET [Type] = 'OrchardCore.Search.Lucene.Model.LuceneIndexSettingsDocument, OrchardCore.Search.Lucene'
  WHERE [Type] = 'OrchardCore.Lucene.Model.LuceneIndexSettingsDocument, OrchardCore.Lucene'

Breaking Changes

IndexingConstants changes :

DisplayTextAnalyzedKey = "Content.ContentItem.DisplayText_Analyzed" instead of "Content.ContentItem.DisplayText.Analyzed"

DisplayTextNormalizedKey = "Content.ContentItem.DisplayText_Normalized" instead of "Content.ContentItem.DisplayText.Normalized"

ContentType will now be searchable with a Term query by using "Content.ContentItem.ContentType.keyword"

Taxonomies module indexing

".Inherited" = "_Inherited"

Queries migration

Elasticsearch maps automatically the data which means that Text fields will always be Analyzed. You can now access the "Stored" value of that Text field by using ".keyword" as a suffix to your field name. This means that you can now use a TermQuery on that ".keyword" field and a MatchQuery on the basic field name.

Notes

There is a new client for Elasticsearch 8.x. Nest has been renamed to Elastic.Clients.Elasticsearch.

Also, OpenSearch will not work with this PR.

There is an initiative with moving Elastic.Clients.Elasticsearch to OpenSearch here :

https://github.com/opensearch-project/opensearch-net

This means that eventually OpenSearch and Elasticsearch will need to become 2 distinct features.

So to summarize compatibility :

Nest = OpenSearch 1.x, Elasticsearch 7.x, Elasticsearch 8.x (in compatibility mode) https://github.com/elastic/elasticsearch-net/tree/7.17

Elastic.Client.Elasticsearch (replacing Nest) = Elasticsearch 8.x https://github.com/elastic/elasticsearch-net https://www.nuget.org/packages/Elastic.Clients.Elasticsearch/

OpenSearch.Client (Nuget package not released yet) = OpenSearch 2.x https://github.com/opensearch-project/opensearch-build/issues/2051

Jan 22 '22 00:01 Skrypt

Squashed commits.

Mar 30 '22 04:03 Skrypt

Any news on this by chance?

May 31 '22 18:05 Piedone

Ping me on Teams or else when you get time.

May 31 '22 19:05 Skrypt

Is there any problem with this PR?

Jun 18 '22 16:06 hyzx86

I'm about to start working on it. I'm still on vacation ...

Jun 18 '22 17:06 Skrypt

found some problem:

[x] Document total never greate than 10, https://github.com/OrchardCMS/OrchardCore/pull/11999
[x] Need check ES Connection before oprate es index . Now， When create elastic index before es service start , the operation will success , but it not be work and can't delete or update

Jul 12 '22 11:07 hyzx86

@piedone I'm starting working on this PR today. @hyzx86 Not sure what you mean about this :

Need check ES Connection before oprate es index . Now， When create elastic index before es service start , the operation will success , but it not be work and can't delete or update

Jul 12 '22 14:07 Skrypt

@Skrypt

I stop my es service
create a es index
After a long time, the operation will succeed

But in fact, it is a wrong check For the following reasons , So I don't think we should just return a Boolean value here. Maybe we can throw an exception or directly return ExistsResponse

Jul 12 '22 14:07 hyzx86

Should we check the link status before each operation of ES?

Or throw an exception every time the API link fails

Jul 12 '22 14:07 hyzx86

We need to prevent allowing to create indexes if we don't have a successfully working ES connection. This is more a UI thing in that case. We need to display a notification that there is no ES connection working when going to the page where we can add indexes. Also, we need to throw an exception as a safety net on these methods.

Jul 12 '22 14:07 Skrypt

Search feature modules changes explanations with sound :

https://user-images.githubusercontent.com/3228637/180492337-011bb4f4-f433-4b6b-9d16-cc1c5bcd2c67.mp4

Jul 22 '22 17:07 Skrypt

Awesome! Would it be possible to merge the Lucene and Elastic UX, to make Elastic a drop-in replacement for the search backend? Like how you manage Media files the same regardless they're stored locally or in Azure Blob Storage.

Jul 22 '22 20:07 Piedone

They are complementary. Both can be used at the same time. But of course, you could use only Elastic Search if that's what you want instead of Lucene.

Right now, there are discrepancies between both implementations. You can't use a Lucene Query and copy/paste it as an Elastic Query because everything is "analyzed" with Elastic Search. So a Term Query becomes a Match Query. But, to remove this discrepancy we would need to make the Lucene implementation work like Elastic Search which would take more time.

I will take some more time to make sure we are doing the proper thing here by setting every document as "Analyzed" with Elastic Search. But the thing is that with Elastic Search everything is also always stored as I documented in the first post.

So, right now, we can't just switch from Lucene to Elastic Search seamlessly unless we would keep the same ContentIndexSettings for both implementations. I removed the "Stored" option for Elastic because it was unnecessary.

Also, another thing to take into consideration. We parse Lucene Queries with our own QueryParser implementation which follows Lucene 4.0 standards and Nest uses Lucene 7.x. not to mention that Lucene 8.x will use a newer implementation of Nest which is getting renamed to Elastic.Clients.Elasticsearch https://github.com/elastic/elasticsearch-net

Lots of moving parts with Elastic compared with Lucene. Maybe it is better to keep Lucene and Elastic implementations unique for now.

Jul 22 '22 22:07 Skrypt

From this, it seems to me that if these two are independent features, we'll have two increasingly diverging Lucene-based search implementations. Why I see this as an issue because while Elastic offers a lot more than vanilla Lucene, I'd also consider it a "production" version of search, since otherwise, you need to employ workarounds in a production app unless you have a single instance and that's all to your environment (like with staged publishing, blue-green deployment, multi-node hosting). Contrast this with the Lucene module which just works without any external dependencies and is thus perfect for local development and testing. Again, like local storage-Azure Blob Storage: You can use both anywhere in principle, but most possibly you want to use local storage while you write the code, and simply switch to Blob Storage in production.

I don't have any insights into how, when, or even whether it makes sense to tackle this, but wanted to share this viewpoint.

Jul 22 '22 22:07 Piedone

I believe that the Lucene implementation, since we created our own Query Parsers, will always be unique.

Also, you can't assume that a Query that works in Lucene.NET (4.0) will still work with Elastic Search 7.x (which uses Lucene 7.x). So while we could try to make an effort to make it work, there will always be discrepancies. If Lucene.NET was on par with Lucene (java) version then it could work but it is not the case. Right now, our Query Parsers try to follow the Elastic Search documentation examples but we did not implement every Query type yet. So going back and forth from Elastic to Lucene will cause issues too. Because of missing Query implementations in Lucene.NET 4.0 or deprecated Query implementations in Elastic.

I see it more like that: if you want to rely on Elastic Search then use a Docker container locally to develop your website if that's what you need to use eventually on production. Else, if you built Lucene Queries and want to move them to Elastic Search you will have to adapt these Queries. Most of them will be fundamentally quite the same.

I see the Lucene module as more like the basic implementation of a search provider that cost nothing more to host. That's the lightweight version of a "Search Engine" that covers most basic needs and that most people will use as the default one. It could become more eventually too by allowing us to scale it as a service like Elastic Search is, but here and now, we can't compare it with the "product" and engineering effort that is Elastic Search.

Here the difference with the Storage modules is that they don't have to deal with Query discrepancies. :smile:

But, I will look at the "Stored" option and maybe insert the concept back in ElasticSearch, even if it doesn't make sense; if that's what we want.

Jul 22 '22 23:07 Skrypt

In any case, I'd only aim for a common subset of features, aimed at supporting the most common scenarios, between the two implementations. So, I'm not arguing for complete feature parity, Elastic will always offer more, but have e.g. what's in our docs as the lowest common denominator (even if that means introducing breaking changes to the Lucene module).

Perhaps it would be feasible to have something like a "compatibility mode" per Query in the Elastic module that uses our query parser, and which is thus compatible with Lucene module queries? And "native mode", where you just use what's built into Elastic directly.

Jul 23 '22 09:07 Piedone

Our current Query Parsers should work with ElasticSearch. Right now the issue is more about the fact that we index differently in Lucene than in Elastic. We don't have an option to "Store" only in Elastic because it would be redundant.

So basically, when we store only in Lucene then we need to use a "TermQuery". So, in Elastic this "TermQuery" becomes a "MatchQuery" because the same indexed field is now "Analyzed". To be able to do a "TermQuery" in Elastic you need to add the "_source" option to your Query. But that would need to be tested to make sure everything works. I believe that this is where we are at now.

https://www.elastic.co/guide/en/elasticsearch/reference/7.12/search-fields.html

Jul 23 '22 16:07 Skrypt

If this is the only difference than it seem to me this can easily be handled, if with nothing else, then changing the query type (i.e. the query type string) before parsing for Elastic?

Jul 24 '22 17:07 Piedone

Ok, I think I understand why we added the "Stored" option for Lucene. Normally, you should never need to "Store" a ContentField. So, the "Stored" option has been added so that we can make any ContentField a technical value like a "GUID". You don't want to "Analyze" a "GUID" because you want to be able to use a "Term Query" with it simply. Here, analyzing a GUID doesn't make sense. The option has been added so that we can have "text enumerations" and filter by them using the TermQuery.

Here, you can read the documentation of the TermQuery that specifies that you should avoid it on "text" fields because they are all "Analyzed".

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

By default, Elasticsearch changes the values of text fields as part of [analysis] (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html).

But, the difference here with Elasticsearch is that they don't allow you to "Store" a ContentField. It is "by design" that for example an Integer or a Float is "Stored" only. So, Elasticsearch decides if the ContentField will be "Analyzed" or "Stored" based on the CLR type passed.

Here, the issue is that we allowed overriding the ContentField natural indexing by "CLR types" by adding the "Stored" option on the ContentIndexSettings for Lucene. So, generally, Elastic and Lucene should behave the same when indexing per CLR type and when a ContentField is not overridden with the "Stored" option.

Now, we can't really remove the "Stored" option that we added for Lucene because people are using this option and it has been useful for some use cases. We need to replace it. What we would need to do is to do just like ElasticSearch does and to always "Store" the original value in the same index or a second one (need to understand their design first). Make sure that the "_source" option on the Lucene Query Parser work (implement it). Also, provide documentation on the breaking change if needed. So, basically, we would need to refactor the Lucene implementation to mimic what Elasticsearch does. That's would be the best approach.

But for now, yes, we can document that the "Stored" option is not available with Elasticsearch and that if someone used it on a TextField then he will need to migrate his Queries for Elasticsearch by using the MatchQuery instead. Also, if he used a TextField as a technical value then he might be better off using the "_source" option on the Query.

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

So, yes, we could eventually refactor Lucene so that we can be able to move from one to the other more easily.

Jul 24 '22 19:07 Skrypt

And here another one about "keyword" data type in ElasticSearch.

https://codecurated.com/blog/elasticsearch-text-vs-keyword/

So basically that would be a second option to take into consideration instead of always storing the original value as ElasticSearch does. In fact, always storing the original value is an option in ElasticSearch.

Jul 24 '22 20:07 Skrypt

Here is what Drupal does :

https://www.drupal.org/node/655724

Drupal	Orchard Core Lucene	ElasticSearch
UnStored	Analyzed	Analyzed naturally by CLR Type passed
Keyword	Stored	https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html
UnIndexed	Not indexed	Not indexed
Text	Analyzed as a TextField and not a StringField, Stored as _source ??? Our Text indexation uses a TextField .	Analyzed but stored in _source
Binary	Stored as _binary ??? Not implemented	https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html

Jul 24 '22 20:07 Skrypt

So perhaps venture into refactoring Lucene once this PR is merged?

Jul 24 '22 23:07 Piedone

Elasticsearch allows manual mapping of CLR types :

https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/fluent-mapping.html#auto-map-with-overrides

Jul 25 '22 03:07 Skrypt

Ok, more on all of this. Right now with ElasticSearch we do not use explicit mapping. But here is a detail I did not know of: Since ElasticSearch 5.0 the string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for a simple document with string:

{
  "some_field": "string value"
}

below dynamic mapping will be created:

{
  "some_field": {
    "type" "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.

So, that means that if we want to be able to switch from Lucene to ElasticSearch seamlessly we would just need to remove the "Stored" option on the ContentIndexSettings and always "Store" the values in a nested ContentField.keyword field.

So as a reaction to this information I remove the IContentIndexSettings from the PR implementation and moved the drivers to the OrchardCore.Indexing module so that they be common to every SearchProviders.

Now, I believe that we can enforce type mapping in ElasticSearch but that can be done also directly on the server if needed. And we do not Store every field in Lucene because we wanted to keep the indexes smaller but I think that would be a better thing to have a nested "keyword" field with a maximum of 256 chars like ES does because it is its default behavior.

Jul 25 '22 18:07 Skrypt

I've just seen your edits to your last comment. Sounds awesome!

Jul 27 '22 14:07 Piedone

Just tested this with the ES cloud service with a CloudId and it works.

Jul 27 '22 18:07 Skrypt

Fantastic!

Jul 27 '22 18:07 Piedone

Also noticed from testing with a database that has around 30k content items that Elasticsearch is really fast at rebuilding the indices. Way faster than Lucene; maybe there is still room for improvement with the Lucene implementation.

Aug 01 '22 15:08 Skrypt

I've been thinking about ways to make the ISearchProviders switchable seemlessly. Right now, the Lucene and Elastic Queries are bound to an Index at all times. We could have a method that calls a "Named Query" with an index name and it could work if the indices would be named the same in Elastic and Lucene but else this is a really "loose" coupling that can lead also to more confusion because index names are set by users.

That's why for now I kept it as 2 different types of Queries : "Lucene" and "Elastic". Also, allowing to have both Search providers to be enabled at the same time allows migrating progressively Queries from one provider to another.

However, the Search module provides a configuration to select a search provider, so this way it can be switched seamlessly. I think that it could also be set by default to a prioritized search provider by configuration if needed per module.

Aug 01 '22 16:08 Skrypt

Could you please create a new issue for us to track the migration story, and thus address necessary improvements later?

Aug 01 '22 19:08 Piedone