odp icon indicating copy to clipboard operation
odp copied to clipboard

NCBIgff2chrom -> gxf2chrom

Open alejandrogzi opened this issue 3 months ago • 2 comments

Hi @conchoecia!

I developed this: gxf2chrom, while thinking about your NCBIgff2chrom.py script!

In short, this is a CLI-tool written in Rust that does basically the same exact thing your script does with some additional features:

  • Can accept GTF files and GTF.gz files
  • Ensures that no proteins with length < 1 are written to the output file
  • Instead of sending the output to stdout, directly writes everything to a file the user specifies
  • Supports "custom" GTF/GFF and not only GENCODE/Ensembl. Instead of only looking for "protein_id", with --feature one can specify the name of the attribute they want to parse (e,g, -f proteinName)

Here is a quick benchmark:

Format odp gxf2chrom fold
gff3 4.30 +/- 0.03 1.88 +/- 0.01 x2.29
gff3.gz 6.27 +/- 0.18 2.05 +/- 0.01 x3.06
gtf --- 1.83 +/- 0.01 ---
gtf.gz --- 1.94 +/- 0.01 ---

The main attribute of this new tool is its speed (which can may be noticed at large scale). On top of that, the good thing about Rust is that does not depend on external packages, so the only thing needed is Rust itself and that is all. This makes it easier to attach to any pipeline/tool/etc through any configuration step/script.

Please let me know what you think!

Best, Alejandro

alejandrogzi avatar Mar 04 '24 04:03 alejandrogzi