mantid ISIS Reflectometry GUI auto-processing stops if file cannot be found

Original reporter: Stephen H at ISIS on behalf of Reflectometry Group

Describe the bug

The ISIS Reflectometry GUI auto-process feature looks up new runs in the journal viewer and then automatically transfers them to the main table on the runs tab and reduces them. This works fine as long as the run file that was returned in the search is available for the reduction that will follow. If it's not then a file not found error is logged and the row reduction fails. The GUI does not appear to attempt to try and reduce the row again.

This never previously caused problems because new runs only appeared in the journal viewer when the run file was available in the archive. With recent changes to IDAaaS, runs now seem to appear in the journal viewer before they are available in the instrument data cache. This means that users who don't or can't mount the archive encounter this issue when trying to auto-process new data.

To Reproduce

To reproduce properly would require you to be auto-processing a currently running experiment on IDAaaS without the archive mounted. You would need permissions to access the experimental data and have the folder in your Mantid user directories.

To simulate you can do the following:

Find some test runs for a given experiment.
Put some but not all into a folder on your local machine.
Turn off the archive in Mantid and open the ISIS Reflectometry interface.
On the Runs tab, enter the experiment RB number into the Investigation ID field and enter the cycle into the Cycle field. Click the Autoprocess button.
The search result should populate with the full list of runs for the experiment. Only the files that can be found will reduce and the rest will result in errors. Adding the missing data files into the folder afterwards will not result in them being found and reduced.

Expected behavior

The auto-processing should be able to handle a delay in the files becoming available. There are a few ways this could be done, but as a start-point we could investigate preventing runs found in the search from being added to the search results until the file is available. We could log at debug level if a run has been found when the corresponding file cannot be accessed. Once the file is available, adding it to the search results at that point should in theory automatically trigger the rest of the process.

Platform/Version (please complete the following information):

Has been an issue most likely from the introduction of the feature.

Additional context

From investigations so far, the auto-process feature works as follows:

Clicking the Autoprocess button results in a call to QtRunsView::on_actionAutoreduce_triggered. This goes and checks for new runs, which stops any polling for new runs while it's doing this.
After getting search results, any new runs are added to the table in QtCatalogSearcher::convertJournalResultsTableToSearchResults.
This all happens asynchronously and when the search and the reduction have finished there is a call to RunsPresenter::autoreductionCompleted(). This starts the polling again, which makes a new call to RunsPresenter::notifyCheckForNewRuns() every 5 seconds to check for new runs.

Feb 16 '24 14:02 rbauststfc

I've had a look at only returning runs in the search results that are available to be reduced, however this requires looking up the file for each search result. Predictably, the file finding is far too slow, so this won't be suitable as a solution.

Another option is to set the option to re-reduce failed rows when autoreducing, which would mean the row would initially fail and throw an error but would then be reduced the next time we poll for new runs. This is a quick fix but would be confusing for users, so I think we want to look at alternative solutions.

Martyn has suggested considering a retry with exponential back-off approach, which would definitely be worth a look.

Mar 04 '24 09:03 rbauststfc

For context, IDAaaS poll for new files in the archive with a set frequency and then copies them into the instrument data cache. They have temporarily reduced the frequency of polling to every 2 seconds (it was originally set at every 15 seconds). For the average INTER file the copy takes about 2 seconds, which means there is currently approximately a 4 second delay between files appearing in the archive/journal viewer and being available in the instrument data cache. Larger files take longer to copy so the delay can be longer for those.

The linked PR (#36895) reduced how frequently the NR GUI looks for new files to every 15 seconds, which has resolved the problem in almost all cases. The work left on this issue is therefore to make the autoprocessing feature a bit more robust if there are exceptions or any future changes to IDAaaS settings that impact the timing.

May 20 '24 10:05 rbauststfc

We started having issues with this again recently as IDAaaS needed to reduce the frequency of copying files from the archive into the instrument data cache. They have temporarily increased the frequency again and I've confirmed with them that, over the long-term, a copy frequency of around 15 seconds should be about right most of the time, but can't be guaranteed. Therefore, as part of this work we should increase our default polling interval so that it can handle a slightly longer than 15 second copy time.

It might be interesting to investigate (even if only to rule out) if there is any feasible way we could eliminate this timing issue altogether by using the instrument cache index files that IDAaaS have added to identify new runs. The following requirements would need to be met:

Performance of the auto-processing feature must not be negatively impacted/too slow.
Having identified new runs, we must be able to get hold of the run title to return as part of the search.
The auto-processing feature should still work when not on IDAaaS (i.e. the archive would be searched directly in that case).

It may not be possible to achieve all of these things, but some prototyping might be useful to be confident about whatever final solution we choose.

@cailafinn, apologies that this has come up after you've already made a start on this. Do you think the above investigation would be do-able at this stage before we settle on the exponential back-off solution?

May 23 '24 08:05 rbauststfc

The exponential back-off was actually pretty trivial to implement. So still might be usable, but I'll park what I've got for now and look into data cache indexing.

May 23 '24 09:05 cailafinn

Thank you @cailafinn, and apologies again for the lateness of this. The exponential back-off may still be worth adding either way then, but will be interesting to know what you find.

May 23 '24 09:05 rbauststfc

mantid mantid copied to clipboard

ISIS Reflectometry GUI auto-processing stops if file cannot be found

mantid
mantid copied to clipboard