reg-gen icon indicating copy to clipboard operation
reg-gen copied to clipboard

Allow all input files to be specified on the command line

Open CreRecombinase opened this issue 5 years ago • 7 comments

Hi, this is a nice piece of software. I was wondering if it would be possible to let the user control where the input data was coming from, instead of just using the rgtdata directory/config file/environment variable. I already have several copies of hg19 scattered across my machine and it would be great if I didn't have to have to keep an (uncompressed) copy of the reference genome for every bioinformatics tool that I use.

Thanks again!

CreRecombinase avatar Jul 17 '20 19:07 CreRecombinase

Hi, thanks for the suggestion. It would have very low priority, and it's also a bit tricky: rgtdata guarantees that genome, annotations, chromosome size and so on, are all uniform for the chosen organism. Passing the genome file as an option but forgetting to set the --organism accordingly, would cause inconsistencies with some (all?) of RGT tools.

A couple of easy solutions, for users with your requirements, are the following:

  • instead of downloading the genomic data, add a symlink to the genome, so you don't have to have duplicate copies. RGT won't know the difference. We could even add an option to setupGenomicData to let you "import"/link a genome file, but it's really trivial to use ln in the terminal yourself.

  • if data.config.user can process absolute paths (I will check if that's the case; if not, I think it's an easy fix), then the problem is already solved. You simply make a genome section in data.config.user with something like:

[hg19]
genome: /mnt/data/genomes/hg19.fasta

and then you can remove the one into rgtdata/hg19.

fabio-t avatar Jul 17 '20 19:07 fabio-t

Thanks for the quick response! I appreciate that what you have now works for you and it's considerate of you to try to protect the user from making mistakes, but if I could make a couple more arguments for my idea:

  • Having arguments passed at the command line (vs assuming a config file/default directory) is an easy way of self-documenting. It's self-documenting for you the programmer, as it lets the user know what pieces of information your program uses. It's also self-documenting for the user because all the inputs go in to the shell's command history, which makes the results easier to reproduce.
  • It's more "polite". When I do pip install RGT I am not asked if I want a rgtdata/ directory added to my machine, and none of the argparse usage statement mentions this directory existing (or not existing). There are lots of HPC environments where user-level directory storage is limited.

CreRecombinase avatar Jul 17 '20 20:07 CreRecombinase

Having arguments passed at the command line (vs assuming a config file/default directory) is an easy way of self-documenting. It's self-documenting for you the programmer, as it lets the user know what pieces of information your program uses. It's also self-documenting for the user because all the inputs go in to the shell's command history, which makes the results easier to reproduce.

I tend to agree, and if I were to write a tool ex novo, I would probably use command line arguments rather than a central configuration file for such big (and versioned) data. But it's a difficult refactoring to justify for a toolbox that went in the opposite direction a long time ago :)

It's more "polite". When I do pip install RGT I am not asked if I want a rgtdata/ directory added to my machine, and none of the argparse usage statement mentions this directory existing (or not existing). There are lots of HPC environments where user-level directory storage is limited.

That is more akin to a bug, and can be solved by improving the existing installation process. We'll look into making this more polite (or at least, clear).

fabio-t avatar Jul 17 '20 20:07 fabio-t

Just bumping this to add my voice in agreement with @CreRecombinase. Writing files to a hard-coded location without user permission is bad form and makes RGT difficult to use in HPC environments and in Docker/Singularity containers.

kelly-sovacool avatar Mar 07 '24 15:03 kelly-sovacool

At a minimum I'd like to see the ability to set a different location for data.config, as it is now RGT is unnecessarily difficult to use in HPC + container environments.

https://github.com/CostaLab/reg-gen/blob/66f5fbb33f199424d09272c628287955e383f7bd/rgt/Util.py#L47

kelly-sovacool avatar Mar 07 '24 15:03 kelly-sovacool

Dear @kelly-sovacool,

I want to create a Docker/Singularity container of RGT. Unfortunately I got this error regarding the data.config file:

FileNotFoundError: [Errno 2] No such file or directory: '/users/fernando.becerril/rgtdata/data.config'

Did you already make a container of this tool? If so could you please suggest me a way to solve the issue, that will be very appreciated.

SalvadorGJ avatar Jun 10 '24 08:06 SalvadorGJ

Dear @kelly-sovacool,

I want to create a Docker/Singularity container of RGT. Unfortunately I got this error regarding the data.config file:

FileNotFoundError: [Errno 2] No such file or directory: '/users/fernando.becerril/rgtdata/data.config'

Did you already make a container of this tool? If so could you please suggest me a way to solve the issue, that will be very appreciated.

@SalvadorGJ I was not able to create a container, I believe it is not possible to do so until the maintainers fix this issue.

kelly-sovacool avatar Jun 10 '24 15:06 kelly-sovacool