Orleans.Indexing icon indicating copy to clipboard operation
Orleans.Indexing copied to clipboard

Orleans Indexing and Lucene.Net

Open KSemenenko opened this issue 4 years ago • 11 comments

Anyway, I've been waiting for this functionality for years, and I've thought a lot about it. After studying the original documents and the code, I came to the conclusion that it would be better to use #3 Lucene.Net for indexing. For example, the same ElasticSearch does this. With Grain, it is easy to do index clustering. On GitHub there is code to support LINQ queries, or code for storing files in Azure Storage. I think that it is possible to use a special Grain which will keep track of a certain type of grain and index the necessary fields. I can use a service, and thus have indexes on each Silo.

So as soon as I had time I built some prototype.

What do you think about Lucene.Net? @ReubenBond @sergeybykov @philbe If you like this idea I can go ahead and make this code stable.

KSemenenko avatar Sep 05 '21 20:09 KSemenenko

public async Task GrainTest()
{
    var grain = new IndexGrain();

    await grain.OnActivateAsync();

    int count = 0;
    int foundCont = 0;

    await Task.WhenAll(Task.Run(async () =>
    {
        for (int i = 0; i < 150; i++)
        {
            var doc = new GrainDocument(i.ToString());
            doc.LuceneDocument.Add(new StringField("property",$"i={i}", Field.Store.YES));
            await grain.WriteIndex(doc);
            count++;
        }
    }), 
    Task.Run( async () =>
    {
        await Task.Delay(1000);
        for (int i = 0; i < 300; i++)
        {
            var doc = await grain.QueryByField("property",$"i={i}");
            count++;

            if (doc.TotalHits > 0)
            {
                foundCont += 1;
            }

        }
    }));

    await grain.OnDeactivateAsync();

    count.Should().Be(450);
    foundCont.Should().Be(150);

}

In this test, of course, I create indexes in Lucene.Net, which is not convenient. Of course for all this you should write wrapper methods. and for queries add LINQ. and we'll have even better than ElasticSearch

KSemenenko avatar Sep 05 '21 20:09 KSemenenko

I tried to implement Lucene full text search based on Orleans and cloud storage providers and I kind of failed. The problem I faced were around performance:

  1. You need a central storage for your Lucene indexes. You can implement the Index Directory using Azure Blob Storage or so but it is relatively slow. In my experience it was much faster to periodically make a backup of the snapshots, by putting them in an archive and then send it over. In combination with a remove disk that works as a backup, the write is not safe.
  2. Lucene is not build for commits of each document. If you wanna have high performance you need to make the changes in batches.

If you wanna achieve high performance and stability it is very challenging, especially because Orleans Applications are deployed much more often than a database. If you can achieve that, it would be great, but I have lost data from time to time and therefore decided to go with Elastic or a database full text system.

SebastianStehle avatar Sep 05 '21 21:09 SebastianStehle

@SebastianStehle Can I ask you to disclose the details of your implementation? did you store/load data in memory on the activation and deactivation of Grain? did you have only 1 index, or did you use MuliIndex?

KSemenenko avatar Sep 06 '21 15:09 KSemenenko

Hi, my implementation is removed now but ít is Open Source: https://github.com/Squidex/squidex/tree/8e088beb1c91626d1f67ec8a09f2b80740639054/backend/src/Squidex.Domain.Apps.Entities/Contents/Text/Lucene

  1. I had multiple indexes, one grain per index.
  2. The index was loaded from a central store like S3 to a local folder on activation.

I think the most important class is this one: https://github.com/Squidex/squidex/blob/8e088beb1c91626d1f67ec8a09f2b80740639054/backend/src/Squidex.Domain.Apps.Entities/Contents/Text/Lucene/IndexManager.cs

It manages the indexes in case a grain gets deactivated and the index is not committed.

SebastianStehle avatar Sep 06 '21 16:09 SebastianStehle

@KSemenenko Conceptually, I don't see a problem with the idea. My intuition is more aligned with @SebastianStehle though. For a limited scale and load, holding and updating indices in memory will probably work. But in a production setting I'd be nervous about the lack of separation of concerns and sharing memory/CPU resource with Lucene.Net in the same process. For production use, I'd look at offloading indexing to something like Elastic or at least hosting Lucene.Net indexing code in a separate process.

Disclamer: I've never user Lucene.Net. My thoughts here are pure intuitive speculations FWIW.

sergeybykov avatar Sep 11 '21 20:09 sergeybykov

That's an interesting thought @sergeybykov Maybe then we need some kind of abstraction like you did for storing states.

An interface for writing data, and interface for Iqueralable to make queries. And then do a basic implementation in memory, for example on List storage. And then do interface implementations for redis, cosmosdb and other databases?

KSemenenko avatar Sep 11 '21 20:09 KSemenenko

Although, for example, I keep silo in a kubernetes cluster and I have no problem adding a couple of virtual machines. Right now I use cosmosdb to store the index. I don't really like this solution. And I still wanted to make solutions with indexes.

KSemenenko avatar Sep 11 '21 20:09 KSemenenko

Yes, an interface with pluggable implementations would be the way to go.

sergeybykov avatar Sep 11 '21 21:09 sergeybykov

When you talk about indexes you have basically 2 options:

  1. Do everything in memory and use things like Dictionary or SortedDictionaries in C#.
  2. Try to find a solution that also work great when the majority of the data is still on the disk, e.g. B+Trees or inverted indexes.

Lucene and databases use the second approach because the goal is to work with large data sets.

I thought the goal of this project is to work on the Key-Value stores and follow the first approach. If we use the database for queries, why do we need Orleans Indexing at all? It would be far easier and more efficient to use stored states directly, perhaps with a mapping function for indexes? https://github.com/sebastienros/yessql/wiki/Tutorial#creating-mapped-index

Another index has the big problem that it can be out of sync with the original data, especially if you do not use transactions.

SebastianStehle avatar Sep 12 '21 14:09 SebastianStehle

To give you an example, I have thousands of users who all have their geo position. All communication with users is through grain, because it is the only source of up-to-date data. I have the user's location in the database, but it is like a storage between activations of grain. So I want to find all the users in the area. And get their grain id. Now I have a table in cosmosdb in which I store geoposition and Grain Id. Now every time the geoposition changes, I have to update the table in the datadatabase.

I see indexing as a convenient abstraction over storage\database. And a fairly powerful search system. yesterday I thought it would be cool to have the Grein itself take care of the index updates. For example, we'll write a post handler. Which will write variables marked with an attribute to the index when the grain method finished.

we can generate something like INotifyPorperyChaned and watch for changes of variables. Or something like that.

Well, in general, it's as abstract as the state of grain. But only for indexing.

KSemenenko avatar Sep 12 '21 14:09 KSemenenko

You are talking about an abstracting to a custom Grain. Then I am on your side ;)

SebastianStehle avatar Sep 13 '21 07:09 SebastianStehle