sortmerna icon indicating copy to clipboard operation
sortmerna copied to clipboard

Ref db indexes and tmp files should not be stored in HOME directory by default

Open ppericard opened this issue 4 years ago • 4 comments

By default, SMR should not write in any location other than the current working directory or the directory pointed to by the --aligned or the --workdir parameter. In most bioinformatics HPC systems, users need to work and write data in specific project directories, designed for fast access and with large amounts of storage. In these configurations, HOME directories usually don't have a lot of storage, and it is forbidden to run jobs in those because they are not designed for analysis. The same could be said for bioinformaticians working on their personal computer and having a drive dedicated to analysis and storage and their HOME directory set up in a different partition/drive with limited storage capacity. In both situations, having a program storing data in the HOME directory by default is a real problem and could lead to crashes or job interruption.

I've addressed in another issue (#233 ) how index files can be stored in specific locations and how users need complete control over naming and storage location.

As for tmp files (the kvdb dir for example) the standard and expected way of dealing with them is to write them in the output directory (which would be CWD by default, or the directory set by the output parameter, --aligned for SMR).

Moreover, these tmp files should never prevent users from re-running a job. This has already been addressed here #212, but I'm not sure the proposed solution of automatically removing these files on a new run would work either.

By definition, tmp files should be removed automatically at the end of a run that generated them.

The standard way to deal with tmp files (at least in bioinformatics) is to create them in the output directory using a unique name for each run. The name could be composed from the output basename and a unique id (from pid or random number, etc) and tmp files could be either files in this directory, or a sub-directory. Something like:

sortmerna --ref ref.fa --reads reads.fq --aligned path/to/output/basename

and files in path/to/output:

basename.fq
basename.sam
basename.163596781325.kvdb/
basename.163596781325.tmp_0

These tmp files need to be cleaned by SMR when completing a run, and in case the run crashes, a new job could be run without being blocked because of the tmp files.

ppericard avatar Apr 20 '20 08:04 ppericard

I just realized that SMR output files are also stored in the HOME directory by default. This is even worse. Everything should be written in the CWD by default, or the --workdir path manually set by the user.

ppericard avatar Apr 20 '20 10:04 ppericard

I suggest that the default --workdir be $(pwd)/sortmerna_wkdir_xxxxxxxxxxxx/ xxxxxxxxxxxx being a unique string that can be sorted by successive runs (maybe using the date and precise time + pid in case several jobs start in the same second)

Then the default structure can be:

WORKDIR/
                                                 kvdb/
                                                 out/

ppericard avatar Apr 20 '20 10:04 ppericard

I had a similar issue when running multiple parallel SMR runs using Snakemake. It is important to specify a separate working directory for each run, otherwise the default working directory will fail

    sortmerna -a {threads} -ref {input.silva_fasta} -reads {input.fastq} \
     -aligned {params.aligned} -other {params.depleted} -workdir {params.wdir} -fastx -v

mweberr avatar Apr 13 '22 13:04 mweberr

Here in #372, another example where things can go wrong by using the home directory by default to store tmp files. When integrated into a workflow such as Nextflow, the default location leads to cache problems and confusion between re-runs of the same SortMeRNA step with different inputs.

ppericard avatar May 09 '23 07:05 ppericard