sacred icon indicating copy to clipboard operation
sacred copied to clipboard

Possible data race when saving large results in file storage

Open alex-van-vliet opened this issue 3 years ago • 3 comments

Hi,

When returning a large result from the main function (one that takes several seconds to serialize), there seem to be a data race on the run_entry object while the heartbeat is saving the file.

For example, with:

import numpy as np
import sacred
from sacred.observers import FileStorageObserver

experiment = sacred.Experiment('Exp')
experiment.observers.append(FileStorageObserver('tmp'))

@experiment.automain
def main():
    return np.random.rand(10000, 10000)

I get the following error:

Warning - Exp - An error ocurred in the '<...FileStorageObserver ...>' observer: dictionary changed size during iteration
Warning - Exp - The observer '<...FileStorageObserver...>' failed at some point during the run.

I am using sacred 0.8.2.

My understanding of the issue is the following:

There is a heartbeat, that heartbeat triggers the heartbeat_event in the observer, which calls save_json on the run_entry. The problem is, this takes some time, so while the save_json is flattening the run_entry, the job is completed and calls completed_event which adds stop_time to the run_entry, causing the run_entry to have a new key (because stop_time was not present before), thus the flatten from the heartbeat fails because it is currently iterating on the run_entry.

Now technically, because of the exception handling, the completed_event will overwrite the run.json anyway, so the file does not seem to be corrupted. But this could still create other issues I am not aware of.

alex-van-vliet avatar Jun 03 '22 00:06 alex-van-vliet

Hi @alex-van-vliet! Thanks for reporting. From your description, this can also happen for a smaller run_entry, it just becomes more likely the longer it takes to serialize.

I think the only liable solution would be a lock. This would have the drawback that it locks the main thread when an event overlaps with saving the heartbeat event. Would you be willing to implement/test this and provide a pull request?

BTW, what are you saving that is that large that it takes several seconds to serialize? Maybe another way of saving your results could solve the issue for you.

thequilo avatar Jun 03 '22 07:06 thequilo

Yes, that is true.

I am not familiar enough with the codebase, could another event clash with the heartbeat? If this is not the case, maybe completed could be triggered after the heartbeat has been joined? In this sense, this seems similar to #345. I agree that one solution could be to remove the timeout on the stop heartbeat, but I don't think that killing the thread is a good option as there could be data corruption. Looking at the interface of the observer, maybe artifact_event/resource_event could also have a data race with the heartbeat? Then the above solution would not fix this issue and a lock would be needed anyway.

We would need to lock every loop that iterates on the observers (basically all emit functions), right (to prevent that one observer receives a different value than another observer)? Then we can also probably remove the timeout because if the heartbeat blocks it is not going to release the lock and the experiment will never finish anyway. But this causes another question from #345, what if the observer is reused in another experiment and the heartbeat from one run clashes with an event from the second run? (I have never had this use case, so I can't really comment)

I don't have the time right now to implement this, but I can try when I have.

Just a lot of segmentations for further analysis, thanks to the exception handling the completed event is called anyway which saves the results properly, so in this case it doesn't corrupt any data, but I could also save it to a file indeed.

alex-van-vliet avatar Jun 03 '22 14:06 alex-van-vliet

This could cause the behavior that I and others have observed, in which after completion the info dict is not fully saved in some cases.

vnmabus avatar Aug 23 '23 10:08 vnmabus