Heartbeat race condition for multiple runs

Open MiS27 opened this issue 7 years ago • 1 comments

I tested this for FileStorageObserver but it probably can affect other observers as well. Heartbeat race condition can result in observer failure or garbage data in metrics file/records.

Steps for reproduction (with breakpoints):

Execute multiple times (sequentially) Experiment.run() for Experiment single FileStorageObserver
Set breakpoint A for last line of IntervalTimer.run() (self.func() from line 470 in sacred/utils.py file)
Set breakpoint B (enabled after breakpoint A is hit) for FileStorageObserver.log_metrics at loading saved_metrics (saved_metrics = json.load(open(metrics_path, 'r')) from line 219 in sacred/observers/file_storage.py file)
Resume thread stopped at breakpoint A to breakpoint B

Note that at breakpoint B we have two different Run instances (going through the stack) and same observer (which is already updated - e.g. has changed self.dir - for the second Run). So metrics.json file from second Run will be accessed and modified by both threads. This can result in FileStorageObserver failure (when one thread writes and other tries to read and fails (as json.dump is saved by chunk)) or garbage data in metrics.json.

Steps for reproduction (without breakpoints):

Execute multiple times (sequentially) Experiment.run() under heavy CPU load (e.g. use dd if=/dev/urandom | pigz >> /dev/null).
Make sure that saving metrics.json takes some time. Heavy CPU load and metrics.json saving taking long time can prevent heartbeat thread (IntervalTimer) finishing before next Run starts.

Original case scenario:

Running Experiments with multiprocessing.pool.Pool and reusing processes.

Problem origin:

Run._stop_heartbeat() waits only 2s for heartbeat thread to end and it's sometimes not enough (heavy CPU load complex metrics).

Possible solutions (that I can think of):

Get rid of 2s timeout for joining heartbeat thread (possible infinite waiting if something goes wrong?)
Extend timeout and kill thread if still alive (possible data loss)

Workaround:

Every time execute Experiment.run in new process (even for sequential execution)

Aug 09 '18 15:08 MiS27

Oh, that is interesting. I think you are right, this is because _stop_heartbeat does not actually make sure the heartbeat is stopped. I dislike option 1, because then experiments might never finish if the heartbeat blocks. But we can certainly increase this time and kill it afterwards (option 2). I think up to a minute should be OK. Potentially make it configurable.

I see two more possibilities: Ensure old heartbeats are stopped right before starting a new run. That way we are only killing the heartbeat if we need to start a new one and leave it be otherwise. That would depend on the experiment as the central place for knowing which run is active. In theory ex.current_run should do just that, but should be carefully checked for thread-safety if we want to rely on it in this context. Second: We could make a copy of the observers for each run. That way we would not reuse the same FileStorageObserver and they should be able to do their business without interference. But this bug, might cause other problems than just with the observers, so I'm not sure this is a good solution.

Aug 21 '18 22:08 Qwlouse