snakebids icon indicating copy to clipboard operation
snakebids copied to clipboard

Automatically create pybidsdb

Open pvandyken opened this issue 2 years ago • 0 comments

In recent versions of snakemake, the app will start subinstances of snakemake under many different circumstances, including cluster submission, use of the run directive, use of shadow, and probably others. Every time these sub-instances are started, generate_inputs will be called and, if no database exists, the dataset will be indexed. For large datasets, this is a huge problem, as indexing can take an immense amount of time (even if just a few minutes, it's still multiplied over the hundreds of times generate_inputs could potentially be called). Thus, for large datasets and complex workflow, a pybids database is essentially mandatory.

We could just expect users to make their own pybids database, but for downstream users, this is inconvenient. I thus propose a mechanism for automatically making a database in the case no database is provided.

I would probably conceive of this as a snakebids "extra" (see #255), so it's something app developers would need to opt into. Basically, if no database is provided, then every time run.py is run, it will use pybids to index the dataset and put the database into the .snakebids folder (which we'd need to create) in the output dir. This database will be passed on to snakemake and generate_inputs().

Importantly, every time run.py is called, the database will be regenerated, regardless of whether it exists. This re-creates the expectation that the dataset will be re-indexed every time the app is run.

Thus, if users want a persistent database across runs, they'll still need to make their own, but within a single run, we can take advantage of the speed boosts of a database.

One concern is how exactly to make the db. On NAS, this can be extremely slow. On AllianceCan, the /tmp dir is mounted in memory, and I would presume in general that /tmp dirs are on more readily accessible storage locations. So it might be enough to always make the database on /tmp then copy it to .snakebids/.

pvandyken avatar Feb 22 '23 20:02 pvandyken