Augustus
Augustus copied to clipboard
PanSN is incompatible with gff2gbSmallDNA.pl
Hello,
In a pangenome project in A. thaliana we have adopted @ekg's PanSN contig naming scheme. Specifically, we name contigs like ${INDIVIDUAL}#${HAPLOTYPE_N}#${CONTIG_OR_CHROM}
, e.g. at9900#1#chr1
.
However, gff2gbSmallDNA.pl line 106 removes anything following the first # in a GFF line, which in our case leaves just the individual name (and completely breaks both Augustus and BUSCO).
One could argue that #
is an eclectic choice for a field separator, but all the sensible ones have already been taken in e.g. the names of individuals etc, hence (I think) why PanSN recommends the #
delimiter. Given that I'm pretty sure GFF files never have in-line comments, would you consider changing the line linked above to remove only comments that start at the start of a line?
I'm tagging @ekg here as perhaps changing the recommended PanSN delimiter (e.g. to ~
or !
) would be a more complete solution?
In any case for now I've changed line 106 to s/^#//;
, i.e. only remove comments from the start of the line, which has fixed the issue for me.
Best, Kevin
There is software, like the parser https://agat.readthedocs.io/en/latest/gxf.html that allows comments elsewhere than at at the start of the line. The Sanger allows such in-line comments in their GFF spec: https://web.archive.org/web/20010208224442/http://www.sanger.ac.uk:80/Software/formats/GFF/GFF_Spec.shtml
Your solution might brake inputs from programs that use such comments, but it appears that the could be a compromise where one considers #
as comment introducing if there were already 7 tabs before it in a line.