4cat
4cat copied to clipboard
Make it possible to pause and resume search workers
In most cases it should be possible to interrupt and resume searches, especially when requesting data from an API in chunks - when a search is interrupted, it should be possible to keep track of the last chunk and pick up where you left off. At the moment, current results are wiped and the search needs to be started from scratch, which is quite annoying for large datasets and also wastes e.g. Twitter API quota in case of a crash.
I think this can only be done with custom resuming code per processor. The logic of how to determine what part of the processing to skip (if any) and how to find the place in the results file to continue writing data, et cetera, is just too different from processor to processor.
This could be done with an optional resume() method in processor classes. If it is present, partial data files are not removed when 4CAT is restarted, but cleaning up is delegated to the processor, which could choose to not clean up, and instead implement logic in process() to skip part of the code or enter the data collection/processor loop partway through.
An open question is how a restarted processor would know how much it has already done. It could use the dataset's parameters field to store this data, but progress is not a dataset parameter. The progress field could be redefined to not just contain a number but more fine-grained progress information. Or a new field could be added.