seml
seml copied to clipboard
Bug: Deleting a failed experiment does not delete all saved source files in the mongodb.
When deleting a failed experiment, not all associated saved source files in the mongodb collections fs.files
and fs.chunks
are deleted. Only those are deleted, which have been stored when staging the experiment and not those stored when starting/running it (i.e. the actual experiment script). The (only) consequence of the bug is a cluttering of those two collections over time.
Expected Behavior
Delete all source files associated to an experiment in the mongodb. This includes the source-files saved during staging (in the mongodb collection listed under seml->source_files) and those saved when running the experiment (in the mongodb collection under experiment->sources).
Actual Behavior
Only the entries in fs.files
and fs.chunks
are deleted, which correspond to the source files saved during staging and listed in seml->source_files.
Steps to Reproduce the Problem
- Count elements in
fs.files
andfs.chunks
collection - Add an experiment (which will fail) using
seml mycollection add myconfig
- Run the experiment using
seml mycollection start
- Delete the experiment using
seml mycollection delete
- Count/inspect elements in
fs.files
andfs.chunks
collection
Specifications
Details
- Version: 0.3.6
- Python version: 3.9.7
- Platform: Linux and Mac OS
Hi, Thanks for opening this issue. I have some clarifying questions.
What do you mean by "those saved when running the experiment"? Are you referring to the source files (optionally) uploaded by sacred
?
Currently, we're only cleaning up those files that seml
actually uploads. Could there be unintended side effects that we're deleting source files (externally added by another tool) that the user does not want/ expect to be deleted?
I do refer to the source files uploaded by seml
when staging the experiment and the source files uploaded by sacred
when running the experiment.
As seml is built upon sacred and deleting an experiment in seml means deleting the corresponding mongodb entry including all the information in that entry of the sacred experiment, I would expect seml to also delete the source files saved by sacred. I do not see any unintended side effects when deleting these sources, as they are orphaned in a sense that no corresponding seml/sacred experiment exists anymore in the mongodb.
As an example of another project doing this: omniboard
directly connects to the mongodb and displays the experiments saved by sacred (or also seml :)). If you delete a sacred experiment with omniboard
, it does not only delete the mongodb entry in the used collection but also automatically deletes the saved sources.
Just to add to this: As far as I can tell from looking at the code. The issue likely also appears with artifacts, which were added with sacred
during the experiment. Since these artifacts can be relatively big compared to source files the issue could bloat the database very quickly.
Since the left over artifacts were taking up large amounts of space in our database I wrote a small script to clean up the DB. You can take a look and try it yourself here: https://gist.github.com/HenniOVP/fc2e54ea56abaf291ee8dab17b5e5f19
It appears to work as intended, but I would advise caution when using it, since deleted files are gone permanently.
See this PR. It addresses both the issue of source files added by Sacred and also improves the workflow for purging orphaned files from the MongoDB (similar in spirit to the notebook linked by @HenniOVP). Feel free to comment on the PR :)
@saper0, @HenniOVP can this issue be closed?
From my perspective you can close the issue. Since the PR by @danielzuegner seems to have resolved the problem :)