mne-bids-pipeline icon indicating copy to clipboard operation
mne-bids-pipeline copied to clipboard

Concurrent operations on `.xlsx` corrupted the log file

Open mathias-sm opened this issue 3 years ago • 7 comments

To speed things up I distributed my subjects' processing across different computers simultaneously. They all failed more or less at the same time, I suspect from a corrupted _log.xlsx file.

All failed when calling save_log:

...
File "[...]/mne-bids-pipeline/config.py", line 3523, in save_logs
    book = load_workbook(fname)
...
File "[...]/lib/python3.10/zipfile.py", line 1362, in _RealGetContents
  raise BadZipFile("Bad magic number for central directory")

After that failure any script would fail when reaching save_log unless I either (i) removed the xlsx file, or (ii) replaced the save_logs function in config.py with a no-op like return None. This suggest that the log file was corrupted, which I assume came from concurrent operations on it across computers.

A simple fix could be to have one such log file per subject, or to have a file format in append only mode, maybe plain CSV, so that this kind of error can't occur?

Or is distributing python run.py --steps=[..] --subject=X across several machines, one for each subject, not really supported?

mathias-sm avatar Jun 24 '22 10:06 mathias-sm

Appending can still lead to race conditions like these. Maybe we should use an sqlite DB instead, but @agramfort is going to hate this

hoechenberger avatar Jun 24 '22 11:06 hoechenberger

@mathias-sm how did you distribute the computation? with dask?

agramfort avatar Jun 25 '22 09:06 agramfort

No, I'm using the lab cluster which itself uses Portable Batch System to schedule jobs across nodes. I set it up so that it creates one job per subject, by using the --subjects="X" syntax of the runner. PBS then starts these jobs on various machines according to resource availability: typically it starts all my jobs at the same time, over 5 to 10 different computing nodes. I realized that for some steps this gives undesired behavior (e.g. the group steps, for which it only looks at a given subject when given this argument) so I exclude these steps and run them separately at the end.

mathias-sm avatar Jun 27 '22 10:06 mathias-sm

hack hack hack :)

what would be a sustainable way forward would be to use something like this

https://github.com/facebookincubator/submitit/issues/25

or to use dask...

Message ID: @.***>

agramfort avatar Jun 27 '22 13:06 agramfort

I didn't realize this would look hack-ish. I assumed that if I can run the pipeline on a single participant, then I can just distribute participants over nodes and be done with it. I guess I could learn submitit or dask, but I already had a 10 lines bash script that takes a file as input, and sends each line as a job to the scheduler which then dispatches them wherever. I have used that for many different pipelines, written in different languages, and it's been quite useful since many pipelines are "trivially" parallelizable over subjects until late stats.

Anyway, this may not need a fix, I was just raising this as it is a recent addition to the pipeline.

mathias-sm avatar Jun 27 '22 14:06 mathias-sm

It's a hack because your scheduler does not seem to be aware of the nature of the processes it's running, hence you observe race conditions like this. If you want to do this "properly", each job needs to create its own output directory. This pipeline is not intended for this type of usage: we basically collect all processing results in a single place and aggregate them.

Or we need yet another "watch" process, which could also be a database sever that accepts the results from all jobs and ensures all data is written properly.

hoechenberger avatar Jun 27 '22 18:06 hoechenberger

what you could try is to use ssh clients with dask to see if it works?

Message ID: @.***>

agramfort avatar Jun 28 '22 08:06 agramfort

We use filelock nowadays for report generation, we should just use it for this, too.

larsoner avatar Dec 02 '22 15:12 larsoner

I think this was fixed in #736

larsoner avatar Jun 28 '23 16:06 larsoner