cqengine icon indicating copy to clipboard operation
cqengine copied to clipboard

DiskIndexes are not discovered by CQEngine on startup

Open codingchili opened this issue 4 years ago • 6 comments

Hello,

I've been having issues with the disk persistence and might have figured out some things.

a) SQLite attribute indexes are not loaded by CQEngine on startup - this means that CQEngine will not index these attributes for new objects that are added to the collection/main table until addIndex is called.

Does this make sense? Is there any way to have CQEngine discover these indexes? Probably not right, because it needs the specific Attribute implementation?

Steps for reproducing:

  1. Create a new DB, add some items, add some indexes.
  2. Items that exist before the index is created will be indexed.
  3. Stop the application.
  4. Start the application, add an item and query on any of the non-primary attributes and get no hits. Alternatively, just inspect the SQLite database and notice that index entries are missing for the new object.
  5. Use addIndex to add indexes as in step 1, add another object and now it's indexed properly. However the object from step 4 is not indexed still.

Would using re-indexing on init help? I'd guess the object will be properly indexed then, but only after step 5? (this would work for me since the index would be recreated before the query is executed.)

I'm experiencing this issue since I'm not adding all the indexes up front, I'm adding indexes whenever a query is made on an attribute that is not yet indexed. Default action of re-init was changed somewhere in 2.10, which might be why I'm having more issues with this.

b) I was also having issues when retrieving an object by querying on the primary key and then removing/updating that object. The object was found but the new object key didn't match for some reason and the object wasn't removed, which kind of broke my app hehe.

In a quick test I changed from HashMap to LinkedHashMap and that seems to work, I guess kryo serialization/deserialization is not deterministic for the HashMap?

Using CQEngine 3.4.0.

Best Regards

codingchili avatar Aug 05 '19 13:08 codingchili

Can you share your code?

muhdkhokhar avatar Aug 08 '19 19:08 muhdkhokhar

This is how the db/indexes are setup

https://github.com/codingchili/chili-core/blob/master/core/main/java/com/codingchili/core/storage/IndexedMapPersisted.java

AddIndex will be invoked dynamically through the applications lifetime.

This is the implementation of the add/update/remove

https://github.com/codingchili/chili-core/blob/master/core/main/java/com/codingchili/core/storage/IndexedMap.java

I had a test-case for the first scenario, I'll try to clean it up and add it.

codingchili avatar Aug 08 '19 20:08 codingchili

This is the expected behaviour currently (and in summary @codingchili - your analysis is correct)...

If you have some disk indexes configured prior to application shutdown (which you will have added via the collection.addIndex() method), then currently CQEngine expects you to call the addIndex() method in the same way (one time for each index) at application startup.

At application startup, that will then cause each of the disk indexes to recover to the same state they were in, prior to the last shutdown.

Specifically, when the collection.addIndex() method is called, that will in turn call the index.init() method, which in turn will call doAddAll() which optionally gives the disk index the opportunity to rebuild itself completely from the contents of the collection. That process is called index reinitialization.

Some history

Prior to CQEngine 2.11.0, the behaviour was that each disk index would completely reinitialize (rebuild) itself at application startup whenever the addIndex() method was called.

However, since CQEngine 2.11.0 (~June 2017) the behaviour was changed, such that at application startup each disk index would instead check if its data structures had been persisted to disk previously.

  • If the data structures had been persisted previously, the index would then skip the reinitialization step, with the (usually correct) assumption that it was already in-sync with the contents of the collection.
  • If the index found that its data structures had not been persisted previously, then it would proceed with the reininitialization step to completely rebuild itself.

Reinitialization can be a very expensive/slow operation for large collections, so that's why the default behaviour was changed. However, you can still request the previous behaviour (to reinit preexisting indexes) if you wish by setting a system property as discussed in the release notes.

So, what can you do?

If you want to integrate with the new behaviour (since that version), the application needs to keep track of which disk indexes it had added, and call the addIndex() method accordingly at application startup. In most applications which add indexes programmatically, keeping track of which indexes were added is a no-op. OR, the application can set that system property configure CQEngine to reinit existing disk indexes by default.

I totally agree, this does make it difficult for applications which add disk indexes dynamically of course.

So you might reasonably ask why doesn't CQEngine just remember which indexes it had?

Part of it is that CQEngine originally started out as an in-memory collection only, so by definition, there was no need to remember this stuff.

However, another reason is that it's not completely straightforward for CQEngine to remember which indexes were in place: because indexes are built on attributes and attributes are application-defined functions or lambdas. We would need a way to serialize attributes or lambdas to disk, in order to restore them later.

Attributes could either be serialized using Java serialization (which requires the attribute, and any objects it references, to implement the Serializable interface). Or, they could be serialized with Kryo which doesn't have such a requirement. However Kryo tends to support the serialization of simple POJOs, rather than complex objects which use inheritance and might reference other objects in non-trivial ways.

So since this wasn't really so straightforward (it requires experimentation with both options) and therefore might be time consuming, I've simply given priority to implementing other features rather than this one.

If anyone would like to help out with this feature, help would be welcome!

npgall avatar Aug 08 '19 22:08 npgall

So if I set cqengine.reinit.preexisting.indexes to true then when I start up it will auto build the index.?

muhdkhokhar avatar Aug 09 '19 20:08 muhdkhokhar

Will it impact the retrieval performance or just one time time consuming on server startup?

muhdkhokhar avatar Aug 09 '19 20:08 muhdkhokhar

@muhdkhokhar the index won't be rebuilt until you call .addIndex, so the object won't appear in searches until the index is added - but when you do call .addIndex the index will be rebuilt and the attribute added to the index. If you don't set re-indexing, the index will not be rebuilt (if it already has a table) and queries on the attribute will never find that object or any other object added during that window.

It will impact the performance once, when .addIndex is called.

@npgall thanks for the detailed answer and the confirmation. I've been having issues with ghost objects/duplicates for the last three years using CQEngine and this finally explains why. Should the documentation be updated to clarify this? Or have I missed something?

That sounds like a useful and fun feature, I'll probably think about it for a while and if I can figure out something that makes sense I might try it. Now I've just refactored a bit and don't really have a need for adding indexes on the fly.

Some thoughts so far

  • would it be okay to store the serialized attribute in the sqlite_master table?
  • backwards compatibility; if the attribute cannot be serialized keep default behavior.
  • not sure if opt-in or opt-out of serialized attributes, or if it's even per attribute/collection.
  • calling addIndex with an attribute that doesn't match the serialized form should trigger a re-index.
  • I like kryo, so I'll probably start with that.

codingchili avatar Aug 09 '19 22:08 codingchili