dicoogle icon indicating copy to clipboard operation
dicoogle copied to clipboard

Deprecate IndexerInterface#index(Iterable<StorageInputStream>)? Optimize batch indexing?

Open Enet4 opened this issue 10 years ago • 3 comments

Indexer interfaces currently need to implement 2 methods related to indexation:

public Task<Report> index(StorageInputStream file);

public Task<Report> index(Iterable<StorageInputStream> files);

The second one is only an aggregation of multiple indexation tasks, which brings no advantages when performed on the plugin itself (it's also an open door to mistakes). I propose that we deprecate it in one of our future minor or major revisions of Dicoogle, so that it can be smoothly deleted.

This should also favour a balanced usage of threads for tasks after we solve #140.

Enet4 avatar Oct 27 '15 18:10 Enet4

I have talked to @fmgvalente on the subject, and he argued that index(Iterable<_>) can improve performance because a plugin implementation can commit the documents to the database in a batch (as in, performing commit once per batch instead of once per file).

It's a tricky decision. On the one hand, a batched commit can make a significant impact on performance if the I/O is the bottleneck. On the other hand, not every case of batched indexation are effectively interpreted as a batched indexation (we got #139 for an equivalent reason), and the responsibility of parallelizing a batched indexation task is not that well outlined (should be either the plugin or the plugin controller, preferably not both). Right now, if I want 4 threads to index a directory with the CBIR indexer, I need to call the index service 4 times on 4 partitions of the directory. Also note that I/O may no longer be the bottleneck when it comes to visual feature extraction.

Making this issue a question so that we make a decision together. We have at least 3 choices:

  • Deprecate the method anyway, for the reasons mentioned earlier.
  • Deprecate the method and introduce a way to tell the indexer whether to commit the results or not. The core can later on be optimized with a waiting queue (or similar) for identifying batches of files, and so making sure that a commit is performed at the end of the batch. This would preferably be an additional commit method, although a special argument in the main method may probably work as well.
  • Keep both methods, asserting that they commit the results on conclusion. With the optimization mentioned in the previous point, we can also improve how we pass batches to the indexer, thus supporting parallelized processing. When we move to Java 8, the batched indexation method can have the default implementation of calling the other method for each file.

Enet4 avatar Nov 14 '15 22:11 Enet4