Augustus icon indicating copy to clipboard operation
Augustus copied to clipboard

PanSN is incompatible with gff2gbSmallDNA.pl

Open kdm9 opened this issue 1 year ago • 1 comments

Hello,

In a pangenome project in A. thaliana we have adopted @ekg's PanSN contig naming scheme. Specifically, we name contigs like ${INDIVIDUAL}#${HAPLOTYPE_N}#${CONTIG_OR_CHROM}, e.g. at9900#1#chr1.

However, gff2gbSmallDNA.pl line 106 removes anything following the first # in a GFF line, which in our case leaves just the individual name (and completely breaks both Augustus and BUSCO).

One could argue that # is an eclectic choice for a field separator, but all the sensible ones have already been taken in e.g. the names of individuals etc, hence (I think) why PanSN recommends the # delimiter. Given that I'm pretty sure GFF files never have in-line comments, would you consider changing the line linked above to remove only comments that start at the start of a line?

I'm tagging @ekg here as perhaps changing the recommended PanSN delimiter (e.g. to ~ or !) would be a more complete solution?

In any case for now I've changed line 106 to s/^#//;, i.e. only remove comments from the start of the line, which has fixed the issue for me.

Best, Kevin

kdm9 avatar Jul 18 '22 13:07 kdm9

There is software, like the parser https://agat.readthedocs.io/en/latest/gxf.html that allows comments elsewhere than at at the start of the line. The Sanger allows such in-line comments in their GFF spec: https://web.archive.org/web/20010208224442/http://www.sanger.ac.uk:80/Software/formats/GFF/GFF_Spec.shtml

Your solution might brake inputs from programs that use such comments, but it appears that the could be a compromise where one considers # as comment introducing if there were already 7 tabs before it in a line.

MarioStanke avatar Jul 19 '22 15:07 MarioStanke