galaxy
galaxy copied to clipboard
gff2bed: show that mixed 6/12 col bed is produced
seems that the converter produces mixed 6 column / 12 column bed format which makes downstream tools struggle.
test data temporarily stored in test-data .. only for demonstration .. I guess this does not work in the test.
Should we just fill with 6 empty columns in the 6col case?
Main question : is it really a bug / feature?
I agree that this seems to be problematic, I read on http://genome.ucsc.edu/FAQ/FAQformat#format1 that "The number of fields per line must be consistent throughout any single set of data in an annotation track."
Do you have an example of a downstream tool that is affected?
Jep. rseqc read distribution for instance. It expects 12 column bed (another issue with this is here https://github.com/galaxyproject/galaxy/issues/9604).
I discovered this when using gencode annotations (gtf) and fed it to the rseqc tool. The problem seems to be that there are transcript annotations which generate 12cols and exon annotations which create 6 columns.
My idea was to change the converter such that always 12 columns are generated, potentially the last 6 empty. If you agree I can prepare this.
Thanks. I've spent some time looking at other GFF/GTF-to-BED conversion tools, and there seems to be 2 types (see https://www.biostars.org/p/321562/#426456):
- line-by-line (gtf2bed and gff2bed from bedops), which simply convert every feature without grouping exons;
- only transcripts (UCSC gtf2ToGenePred, gff3ToGenePred and genePredToBed), which group the exons of a transcript together in one line, discarding the other lines. This is used in https://github.com/galaxyproject/tools-iuc/blob/master/tools/gtfToBed12/gtfToBed12.xml
There is also another Galaxy GFF-to-BED converter in lib/galaxy/datatypes/converters/gff_to_bed_converter.xml which is simpler than tools/filters/gff2bed.xml, and always produces BED6 files.
A possible solution would be to:
- modify
lib/galaxy/datatypes/converters/gff_to_bed_converter.xmlto setbed6as output format - move
tools/filters/gff2bed.xmltolib/galaxy/datatypes/converters/and modify it to become as GFF-to-BED12 converter.
I can help with that if it sounds a good plan.
Agreed. Not sure about "move tools/filters/gff2bed.xml".
Agreed. Not sure about "move tools/filters/gff2bed.xml".
OK, I've added a commit that for now just merges the 2 gff_to_bed_converter.py scripts, and left a comment in gff2bed.xml.
Planemo tests pass for both converters now. If my changes look good to you, I think the only thing left is to move the new test data to https://github.com/galaxyproject/galaxy-test-data/
I will check asap. Reminds me of https://github.com/galaxyproject/galaxy/pull/8369 which also needs test data over there and is slightly related.
I've pushed this to 20.09 due to Nicola's comments, but as a bug we can always backport this.
Happy to get this in as a bugfix, I think we're almost there ?
Rebased to fix a conflict.