SnpEff icon indicating copy to clipboard operation
SnpEff copied to clipboard

snpeff.config: Split into config, database of genomes/sources and codon table

Open J-Moravec opened this issue 1 year ago • 0 comments

Story

I am going through the tutorial on how to make one's own database, since I am working with non-standard genome, and found multiple issues with snpeff.config. snpeff is installed locally and the binary run using bash wrapped in $HOME/bin. Unfortunately, seems that after reading documentation on how snpeff treats config, there seems to be some issues.

The issue 10mb config file with 381k lines is not the way to do. Something like that shouldn't be expected to be editable by hand (as in building your own database) or something that needs to be pointed towards. That is a database.

As I am going through all the 381 000 lines, I see that the snpeff.config is just doing too much.

Expected Behaviour

  1. There should not be a need to point to snp.config if run from different folder.
  2. snp.config should be small enough to be editable by hand. Less than hundreds of lines.
  3. Ideally, local configs should add or optionally entirely replace global configs (for better parametrization and testing).
  4. There should be no test.genome or test.Case.genome in a production config. That should be in tests only.

Solutions

  1. It should be possible to get the path where snpeff was installed and read the config from there. I haven't done it in Java, but in several other programming languages.

  2. snp.config needs to be split into 3 different files. Config itself, a database of genomes, mappings and names, and codon table. Maybe there is something else as well, I was able to look only at the 50 000 lines of local cache from GenBank genomes. But tail suggests that everything past first 200 lines is the same.

  3. Improve argument parsing of snpeff. In addition, incorrect option seems to throw stacktrace and not fail gracefully and inform me about incorrect option. After looking at code and experimenting, this might be caused by snpeff first trying to read config and only then looking at passed arguments.

*Possible issues

Database would add another layer of complexity. But after that, a lot of code will be substantially simpler and cleaner, since the snpeff.config.parser , snpeff.database.parser etc. (hypothetical names) would be worried only about single problem. After these changes, testing should also be much easier, as each case can be easily tested separately with minimum testing data. And non of this have to end in production.

J-Moravec avatar Mar 13 '24 23:03 J-Moravec