reg-gen
reg-gen copied to clipboard
Allow all input files to be specified on the command line
Hi, this is a nice piece of software. I was wondering if it would be possible to let the user control where the input data was coming from, instead of just using the rgtdata directory/config file/environment variable. I already have several copies of hg19 scattered across my machine and it would be great if I didn't have to have to keep an (uncompressed) copy of the reference genome for every bioinformatics tool that I use.
Thanks again!
Hi, thanks for the suggestion. It would have very low priority, and it's also a bit tricky: rgtdata guarantees that genome, annotations, chromosome size and so on, are all uniform for the chosen organism. Passing the genome file as an option but forgetting to set the --organism accordingly, would cause inconsistencies with some (all?) of RGT tools.
A couple of easy solutions, for users with your requirements, are the following:
-
instead of downloading the genomic data, add a symlink to the genome, so you don't have to have duplicate copies. RGT won't know the difference. We could even add an option to
setupGenomicDatato let you "import"/link a genome file, but it's really trivial to uselnin the terminal yourself. -
if
data.config.usercan process absolute paths (I will check if that's the case; if not, I think it's an easy fix), then the problem is already solved. You simply make a genome section indata.config.userwith something like:
[hg19]
genome: /mnt/data/genomes/hg19.fasta
and then you can remove the one into rgtdata/hg19.
Thanks for the quick response! I appreciate that what you have now works for you and it's considerate of you to try to protect the user from making mistakes, but if I could make a couple more arguments for my idea:
- Having arguments passed at the command line (vs assuming a config file/default directory) is an easy way of self-documenting. It's self-documenting for you the programmer, as it lets the user know what pieces of information your program uses. It's also self-documenting for the user because all the inputs go in to the shell's command history, which makes the results easier to reproduce.
- It's more "polite". When I do
pip install RGTI am not asked if I want argtdata/directory added to my machine, and none of theargparseusage statement mentions this directory existing (or not existing). There are lots of HPC environments where user-level directory storage is limited.
Having arguments passed at the command line (vs assuming a config file/default directory) is an easy way of self-documenting. It's self-documenting for you the programmer, as it lets the user know what pieces of information your program uses. It's also self-documenting for the user because all the inputs go in to the shell's command history, which makes the results easier to reproduce.
I tend to agree, and if I were to write a tool ex novo, I would probably use command line arguments rather than a central configuration file for such big (and versioned) data. But it's a difficult refactoring to justify for a toolbox that went in the opposite direction a long time ago :)
It's more "polite". When I do pip install RGT I am not asked if I want a rgtdata/ directory added to my machine, and none of the argparse usage statement mentions this directory existing (or not existing). There are lots of HPC environments where user-level directory storage is limited.
That is more akin to a bug, and can be solved by improving the existing installation process. We'll look into making this more polite (or at least, clear).
Just bumping this to add my voice in agreement with @CreRecombinase. Writing files to a hard-coded location without user permission is bad form and makes RGT difficult to use in HPC environments and in Docker/Singularity containers.
At a minimum I'd like to see the ability to set a different location for data.config, as it is now RGT is unnecessarily difficult to use in HPC + container environments.
https://github.com/CostaLab/reg-gen/blob/66f5fbb33f199424d09272c628287955e383f7bd/rgt/Util.py#L47
Dear @kelly-sovacool,
I want to create a Docker/Singularity container of RGT. Unfortunately I got this error regarding the data.config file:
FileNotFoundError: [Errno 2] No such file or directory: '/users/fernando.becerril/rgtdata/data.config'
Did you already make a container of this tool? If so could you please suggest me a way to solve the issue, that will be very appreciated.
Dear @kelly-sovacool,
I want to create a Docker/Singularity container of RGT. Unfortunately I got this error regarding the data.config file:
FileNotFoundError: [Errno 2] No such file or directory: '/users/fernando.becerril/rgtdata/data.config'Did you already make a container of this tool? If so could you please suggest me a way to solve the issue, that will be very appreciated.
@SalvadorGJ I was not able to create a container, I believe it is not possible to do so until the maintainers fix this issue.