dicoogle Bulk unindexing for IndexerInterface

This proposes an extension to the IndexerInterface so that API consumers can request multiple items to be unindexed at once. Resolves #594.

This extension to the API should be carefully reviewed and evaluated to ensure that implementers can use it properly and that it can be well integrated into existing workflows.

Breaking changes are minimal. It is only a problem if an existing plugin already has a method with the exact same signature, which is unlikely. Hence, we are in position to deliver this in 3.4.0, or postpone to version 4 anyway.

To ensure some level of quality in this suggestion, we should have:

[ ] at least one proof of concept plugin implementation
- We could take this as an opportunity to move forward with hosting the open plugins on GitHub.
[x] at least one example that the Dicoogle core can consume this method in a usable way without downsides
- I suggest an extension to the unindexing service so that it accepts multiple URIs

Problems resolved in the latest version:

[x] No way to retrieve the task's unindexing progress from another thread. Now has a second parameter as a progress callback.
[x] Even if the indexer periodically flushes some unindexing operations during the task, other routines cannot work with URIs which have already been unindexed before the task ends completely. This could pose an issue in file removal tasks, because it requires unindex to finish before we start cleaning up the files. The operation is now asynchronous.

Known caveats:

~~It is not clear whether we can properly constrain the number of parallel unindexing tasks by pushing this to our indexing task pool. The current indexing task manager depends on task objects returning a different kind of report, which UnindexReport does not implement. Maybe it should be subclassed from Report?~~ As of the latest version, unindexing tasks can be dispatched by the indexer thread pool.
Having a second parameter to keep track of progress is quirky. One would naturally expect an UnindexTask<T> to extend Task<T>, but it is very different from indexing tasks and would make it even harder to attach to the task pool.
The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

Jul 21 '22 10:07 Enet4

My opinion is to have it in 3.2.0. Thinking aloud: my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed). But I can take a look in our use case and see how it will look like.

Jul 22 '22 08:07 bastiao

my main concern about the current new API is the not capability to report the URIs unindexed (or perhaps the ones it failed).

That is true. I feared that including that information would overcomplicate the interface, but the cost of not knowing which files were unindexed and which ones were not is possibly too high. I will revise this.

Jul 22 '22 08:07 Enet4

do you have plans to "un-draft" this PR?

Aug 25 '22 07:08 bastiao

do you have plans to "un-draft" this PR?

I have updated the PR with a plan. It should not become ready for review until there is evidence that this API works well for implementers and consumers. We also have two known problems and further thinking is needed to either resolve them or just accept them as not enough of a problem to merit blocking the extension. Feedback is more than welcome.

Aug 25 '22 09:08 Enet4

Updated the API to include a progress callback. I updated the root message with known caveats.

Oct 15 '22 13:10 Enet4

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

Jul 08 '23 15:07 bastiao

The unindex bulk API can be beneficial in certain scenarios, and we are currently approaching a milestone where new APIs will be incorporated. It would be appreciated if you could take into account the undraft this pull request, as it has been pending for quite some time. If necessary, we can revisit and iterate on it at a later stage.

I have integrated bulk unindexing to the core unindex Web service. It would be better to also have a proof of concept implementation other than the default.

Jul 21 '23 11:07 Enet4

I added this point to the known caveats.

The resulting object is too complex and granular. While there is a lot of power in being able to specify an error for each URI, it is not always the case that files will be unindexed independently, so it may be hard to build such results in batched unindexing.

After trying to implement this interface, I am convinced that the API needs to be redesigned to be easier to implement.

Mar 25 '24 11:03 Enet4

I have readjusted the Unindex report so that each failure may specify multiple URIs, which is a better reflection of what will happen in most implementations.

Apr 17 '24 15:04 Enet4

dicoogle dicoogle copied to clipboard

Bulk unindexing for IndexerInterface

dicoogle
dicoogle copied to clipboard