DrWatson.jl
DrWatson.jl copied to clipboard
Possible data provenance functionality
With @JonasIsensee , @tamasgal and @sebastianpech we discussed that smaller groups of scientists may not find it sensible to opt for large data management software such as CaosDB. But it would still be great to have basic data provenance for forms out data outside .bson
.
.bson and similar formats are covered satisfactorily by DrWatson due to the automatic adding of git info, and the automatic adding of source file that generated them. This is not possible for e.g. figures or CSV files.
What could be possible is to have a central file, next to Project.toml, that is also .toml
or .yml
based, and works as a dictionary. It maps unique identifiers to a set of properties, the first of which is file
, and it just contains the file path relative to the project main folder. The advantage of using toml
is that it is human readable and can be searched with Ctrl+F
. Notice that specialized parameter searches are more suited for the result of a function like collect_data
and thus do not need to be considered for this functionality.
Other properties could be added, like source file used, date produced, savename
of parameters used, author, git commit, etc.
All in all this is a great compromise between the complexity of a full data manager and having data provenance for figures, CSV, etc.
Actually, I don't immediatelly see why the mapping should map unique identifiers. Seems to me that the format could directly map a file name (with its relative path) to the dictionary. The file name is also unique after all.
The advantage of using toml is that it is human readable and can be searched with Ctrl+F
I think one advantage of using a binary file format is that we can attach julia types as metadata. So I could theoretically directly attach the parameter config dict that led to this specific file, instead of converting it into a string before.
The search functionality must the be of course implemented in DrWatson.
Actually, I don't immediatelly see why the mapping should map unique identifiers. Seems to me that the format could directly map a file name (with its relative path) to the dictionary. The file name is also unique after all.
I think using the filenames as identifiers is fine. It also suggests that the database file is only used for storing metadata for files and not for storing arbitrary data entries.
I think one advantage of using a binary file format is that we can attach julia types as metadata. So I could theoretically directly attach the parameter config dict that led to this specific file, instead of converting it into a string before.
I thought about this as well, but it has a significant downside: filesize will explode quickly...? We should compare. In fact, if we use a central .bson
file as the provenance, we can do this provenance thingy pretty much immediately. Writing a Julia function that does this isn't a big deal...
The search functionality must the be of course implemented in DrWatson.
This is really hard to do though, and probably not worth the effort. Searching withing dictionaries of arbitrary type is also dubious; if a user gives "p"
you have to search all keys, and all values that could potentially include "p"
or :p
, and fields of custom types as well. Too complicated I feel, and would be a pain in the butt to debug for all possible use cases.
See #152 for a quick and dirty sketch of the idea.
Writing a Julia function that does this isn't a big deal...
Definitely not too difficult to do. Fitting it into the DrWatson workflow is a little harder.
So one fundamental question: Would this functionality replace savename
?
Would this functionality replace savename?
What never! I use savename
for figure titles :D
What never! I use savename for figure titles :D
Clever :)
So we would promote two approaches that have a similar purpose
- The current one. With
savename
,tagsave
, ... that only fully works if you can store the metadata alongside you results. (Though,savename
is universal, that's what's so nice about it. So storing the parameter set can always be done, no matter the file type) - Storing all info about the simulation in the central database file
So we would promote two approaches that have a similar purpose
yeap, precisely.
Before we go off talking about implementation details, I believe we should think clearly about what we actually want from this software and what we need it to truly add value to the workflow. (or the reproducibility.) For example: The filename may be unique enough to identify a file but it can't tell if the file has been modified / overwritten. In that case hashvalues would be good.
Also - how can we do this to keep it as extensible as possible?
2. Storing all info about the simulation in the central database file
Saving into a central file is dangerous when generating data on multiple workers in parallel. I think collecting the metadata into a database after the fact would be safer.
I think collecting the metadata into a database after the fact would be safer.
Yeah, but that is why the external-server-complicated-detached database software like CaosDB exist. Like you said, we really have to consider what we want to do. I think we all agree that we don't want this to be leading into any heavy dependencies....
We should also discuss whether we want to match existing functionalities. Personally, I don't see a reason to try and match the complexity (and capabilities) of those data management software, as there are several software that provide such options. The same thing holds for being able to tell if a file is modified / overwritten: it again can be managed by these advanced software.
It is also a matter of effort: I definitely can't spend a lot on time on this.
I believe we should think clearly about what we actually want from this software and what we need it to truly add value to the workflow. (or the reproducibility.)
Yes. DrWatson is all about making life easy, so let's start there.
Saving into a central file is dangerous when generating data on multiple workers in parallel. I think collecting the metadata into a database after the fact would be safer.
Good point. Also for me it's not 100% clear when to save to memory and when to actually write the database file. This can become pretty complex. How are we dealing with large IO operations? Can they occur?
Just an idea: What about not having a single file, but one file for each file in the folder structure (so similar to git). The could be in a folder thats also in the .gitignore. For the user I makes no difference.
Just an idea: What about not having a single file, but one file for each file in the folder structure (so similar to git). The could be in a folder thats also in the .gitignore. For the user I makes no difference.
How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.
How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.
Fair point. So no BSON format then either.
How do you make this user-readable? That is the entire point: you need a format that the user can read, in order to see which commit and or which parameters lead to the creation of the file.
I'm not sure I agree on this.
We have collect_results
. I think it would be an option to use collect_results
to aggregate the metadata into DataFrame
s. That should make it searchable and human readable.
(And there are multiple Julia packages to help with displaying these in electron windows or the browser)
In that case it also shouldn't matter whether we put the metadata right next to the real data or
into a separate folder-tree.
Or as a completely separate alternative one could find out if something like caosdb
can be made much more accessible. Provide a binary wrapper package where you just call caosdb()
to start such a container.
(Last time I tried it, they had fully functional setups inside docker)
So just to wrap this up, what options are we currently talking about.
- single file vs. multiple files
- plain text vs. binary
- external database
single file: As @JonasIsensee pointed out, a single file is tricky when it comes to running simulations in parallel.
multiple files: I kind of like this approach because you don't have to write large junks of data every time you add new metadata to a file. Also it support parallelism. It's for me also currently the only option to store commit info, as I pointed out in https://github.com/JuliaDynamics/DrWatson.jl/issues/153.
plain text: I get that it's nice to be able to search in it with just a text editor. Also If I have multiple files I can do that by using grep, ripgrep, ag, ... or just cat all the files and pipe the output in a new file for searching.
binary file: Has the advantage of being not restricted to string representable metadata. Searching must be done from within DrWatson (eg. trough collect_results) or any extra piece of software that supports loading that format.
I thought about this over night and here is my conclusion:
In many points of the documentation I've actively tried to point out that DrWatson is not a data manager. I honestly think this is a good idea, because there advanced and good data managers. What we are talking about here is making DrWatson a data manager. I don't have a problem with that, but we have to be aware that the competition in data management is very high: we would have to work really, really hard to make it as good as other data management software. Of course, we might not care to make it as good. But we would definitely care about making it sufficiently good, and given the existing complexity of data management, this will still be very hard.
There is CaosDB, which is good, and people tried hard to make it good, and did research on it etc. We are also lucky to personally know every member of the dev team. My opinion is to simply integrate CaosDB in Julia (because I don't know if it works now, I don't think so) and make it work well with DrWatson. DrWatson will become a dependency of CaosDB, not the other way around.
This way, if someone wants truly advanced data provenance, etc., they can use CaosDB. It is clear to me that a scientific project manager is always necessary, while a database isnt: you need to have a scientific project to get the data.
The point I am trying to make is that we should be careful to not re-invent the wheel. If you read the CaosDB paper, there are already hunderds of ideas on how to do data management.
We can contact Alex and ask for help in the integration as well. @salexan2001
Hi, I think that is a good idea and I will definitely help with the integration.
- After lots of (internal) discussions about the structure of caosdb-client-libraries (R, Python, Julia, C++) we have come to the conclusion that a C++-Lib with high-level bindings to the other languages is the best option. We are currently working on this in this repository https://gitlab.com/caosdb/caosdb-cpplib and hopefully come up with a first release in the next months.
- There is also much recent progress on a single-user docker container for CaosDB, so that it is possible to easily run the server on a single machine (which is probably very helpful for numerical simulations). @quazgar is working on this. Is there already a release scheduled?
- Maybe interesting: We have just published our approach for a standardized file system layout here: https://doi.org/10.3390/data5020043 However, this approach is more focused on standardizing file systems in the absence of more specialized standards, so it might be too generic here. The advantage is, that the default CaosDB-crawler can understand the structure making the implementation simpler.
- One advantage of CaosDB is that it does not really care about the structure of the files, so all of the above versions (single file, multiple files, ...) would work. You could in principle also mix these approaches or support multiple of these (in case there is no universal "best" option).
Hi all!
- There is also much recent progress on a single-user docker container for CaosDB, so that it is possible to easily run the server on a single machine (which is probably very helpful for numerical simulations). @quazgar is working on this. Is there already a release scheduled?
I am currently working on a Debian package indeed. The package will include a Docker image, sensible default configuration and a daemon script to start up a CaosDB-in-Docker instance in the background. Our rationale: We want to be as independent as possible from specific host system settings. Of course everyone is free to build a leaner package, if they find the time.
As for a tentative release schedule: I hope that we can name a date next week after checking what else is on our agenda. And I must say I am impressed by your plans and looking forward to seeing CaosDB used in DrWatson :smiley:
Sorry for coming late to the party. I'd like to show you something which is implemented in a framework I use sometimes and this is basically the outline of what we are currently aiming for in KM3NeT: https://cta-observatory.github.io/ctapipe/examples/provenance.html
The example above shows the "manual usage". I think such a provenance tracking could be easily hooked into existing DrWatson functions.
@Datseris @tamasgal @JonasIsensee Out of curiosity, and maybe a bit boredome, I started coding a simple metadata and parallel simulation extension for DrWatson (https://github.com/sebastianpech/DrWatsonSim.jl).
I explain the two main use cases in the README, though I kinda don't like the simulation syntax yet, I will give it a try and will likely adapt it. I currently see the project more as a way of checking if such a functionality might improve my workflow.
About the implementation. Initially I just simply wanted to store bson files with metadata in a folder .metadata
. However, as I was aiming for supporting parallel running jobs, I needed some locking mechanisms to generate unique ids and also update the index without having race conditions (The index is used for drastically improving the querying speed). The locking works surprisingly well. Only one detached process is allowed to update the index and get a new id, and multiple detached processes are allowed to read, unless one process is writing. Even in the worst cases I could currently produce I don't have any deadlocks or race conditions.
I decided to use the method with incrementing unique ids, because I use those ids in the second scenario for keeping track of simulation runs (eg. every new run generates a new folder based on the id). Nevertheless, one id is always related to one file only and vice versa. This is just a design decision, theoretically the implementation supports file independent metadata storage.
In general, the package is build around DrWatson (or a Julia project at least). For example, I only store paths relative to the project directory, this way the metadata folder can be used on other devices as well.
Let me know what you think. I'll keep testing the new workflow, maybe it turns out it's not such a necessary feature after all.
and maybe a bit boredome
damn, I have an entire pipeline of projects for JuliaDynamics if you are interested! :D
Jokes aside, thanks a lot for sharing, this seems promising. I will read it in detail and will discuss further on our next meeting! @JonasIsensee , @tamasgal if you guys have some spare time please have a look as well and we can all talk about it! :)
Boredom in a topic related sense. So quite likely procrastination actually :)
Hey @SebastianM-C , this is a really neat idea! I'm definitely going to try this out at some point.
Some questions: Could this be integrated with a cluster queue? How would the queue jobs connect to the metadata guard in that case?
What would happen if the parent process, a.k.a. the one guarding the metadata dies in the meantime? I guess there should probably be a fallback to save the metadata file in the same folder as the data.
What would happen if the parent process, a.k.a. the one guarding the metadata dies in the meantime?
Once the metadata file is created reading and writing is no problem. I assume, that if you have parallel processes with IO that read and write the same file ie. access the same metadata you have some have taken some precautions your self to not have race conditions.
Could this be integrated with a cluster queue? How would the queue jobs connect to the metadata guard in that case?
I've been thinking about this as well and in the current implementation it's not possible without limitations. If you can assure that all jobs access the same folder, it works though.
This makes me think if it is maybe a better decision to store the metadata with the actual file eg. for somefile
there is a .somefile.metadata
in the same folder. The interface could stay the same and for the simulation part one would need a alternative method for generating the contiguous ids.
@JonasIsensee @Datseris I changed the method for storing metadata. I thought about the cluster computations and the problem of merging the metadata. With incrementing ids this is pretty though and one needs an extra step for importing data. What I'm doing now is basically using the .metadata
folder as my index. I generated a hash that's unique for every path in the project directory and use that as the name for the metadata file. This way lookup for a known path is very simple and an O(1) operation. Searching for a field value is still O(n).
The huge advantage of this method is, that one can now merge two metadata folder just by copying the content over, no need to take care of updating any ids.
To get a unique folder id for a simulation run, I now just consider the one folder that holds all the simulation folders and pick the smallest positive integer, that has not been taken by another simulation run.
@sebastianpech I've just checked out your code. I think it seems great. I would still like you to explain the function @run
live, I had trouble understanding it fully. I hope this functionality can work in the even more fundamental level of having only a single "parameter set" so that dicts
is not actually necessary.
I think I understood that all process is project-directory-relative. However, the end of the readme states: If p
is a relative path, make it absolute using abspath, otherwise leave p as it is. This confused me a bit and would be another point to clarify.
Interesting indeed. I have to look at it closer, although my workflow is quite different. It would probably be cool to demo this in the next call?
@tamasgal
Interesting indeed. I have to look at it closer, although my workflow is quite different. It would probably be cool to demo this in the next call?
@Datseris
I would still like you to explain the function
@run
live, I had trouble understanding it fully. I hope this functionality can work in the even more fundamental level of having only a single "parameter set" so that dicts is not actually necessary.
Yes, I can show you some stuff tomorrow. @run
works for a single parameter also, but I'm not fully satisfied with its flexibility yet maybe you guys come up with a better idea.
I think I understood that all process is project-directory-relative. However, the end of the readme states: If p is a relative path, make it absolute using abspath, otherwise leave p as it is. This confused me a bit and would be another point to clarify.
My idea here was that considering paths relative to the cwd allows a workflow where you can quickly lookup metadata for a file in the current folder. So if I'm in eg. in the plots directory I just need to spawn julia and do:
using DrWatson
@quickactivate
using DrWatsonSim
Metadata("myplot.png") # Aso autocompletion of the path works here
However, to lookup the metadata I need the hash for the projectdir relative path. So I first make the path absolute and then relative again, but now relative to the projectdir
From the above discussion it's still not entirely clear to me what the desired outcome is for this. Is version control at all a part of what you're going for here? Are you simply trying to create a way of intuitively accessing the contents of DrWatson.jl's "data" folder or is it all of the folders that you could end up writing things to (plots, data, notebooks)?