galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

gff2bed: show that mixed 6/12 col bed is produced

Open bernt-matthias opened this issue 5 years ago • 9 comments
trafficstars

seems that the converter produces mixed 6 column / 12 column bed format which makes downstream tools struggle.

test data temporarily stored in test-data .. only for demonstration .. I guess this does not work in the test.

Should we just fill with 6 empty columns in the 6col case?

Main question : is it really a bug / feature?

bernt-matthias avatar Apr 10 '20 11:04 bernt-matthias

I agree that this seems to be problematic, I read on http://genome.ucsc.edu/FAQ/FAQformat#format1 that "The number of fields per line must be consistent throughout any single set of data in an annotation track."

Do you have an example of a downstream tool that is affected?

nsoranzo avatar Apr 23 '20 15:04 nsoranzo

Jep. rseqc read distribution for instance. It expects 12 column bed (another issue with this is here https://github.com/galaxyproject/galaxy/issues/9604).

I discovered this when using gencode annotations (gtf) and fed it to the rseqc tool. The problem seems to be that there are transcript annotations which generate 12cols and exon annotations which create 6 columns.

My idea was to change the converter such that always 12 columns are generated, potentially the last 6 empty. If you agree I can prepare this.

bernt-matthias avatar Apr 23 '20 17:04 bernt-matthias

Thanks. I've spent some time looking at other GFF/GTF-to-BED conversion tools, and there seems to be 2 types (see https://www.biostars.org/p/321562/#426456):

  • line-by-line (gtf2bed and gff2bed from bedops), which simply convert every feature without grouping exons;
  • only transcripts (UCSC gtf2ToGenePred, gff3ToGenePred and genePredToBed), which group the exons of a transcript together in one line, discarding the other lines. This is used in https://github.com/galaxyproject/tools-iuc/blob/master/tools/gtfToBed12/gtfToBed12.xml

There is also another Galaxy GFF-to-BED converter in lib/galaxy/datatypes/converters/gff_to_bed_converter.xml which is simpler than tools/filters/gff2bed.xml, and always produces BED6 files.

A possible solution would be to:

  • modify lib/galaxy/datatypes/converters/gff_to_bed_converter.xml to set bed6 as output format
  • move tools/filters/gff2bed.xml to lib/galaxy/datatypes/converters/ and modify it to become as GFF-to-BED12 converter.

I can help with that if it sounds a good plan.

nsoranzo avatar Apr 23 '20 18:04 nsoranzo

Agreed. Not sure about "move tools/filters/gff2bed.xml".

bernt-matthias avatar Apr 24 '20 10:04 bernt-matthias

Agreed. Not sure about "move tools/filters/gff2bed.xml".

OK, I've added a commit that for now just merges the 2 gff_to_bed_converter.py scripts, and left a comment in gff2bed.xml.

Planemo tests pass for both converters now. If my changes look good to you, I think the only thing left is to move the new test data to https://github.com/galaxyproject/galaxy-test-data/

nsoranzo avatar Apr 25 '20 01:04 nsoranzo

I will check asap. Reminds me of https://github.com/galaxyproject/galaxy/pull/8369 which also needs test data over there and is slightly related.

bernt-matthias avatar Apr 25 '20 10:04 bernt-matthias

I've pushed this to 20.09 due to Nicola's comments, but as a bug we can always backport this.

mvdbeek avatar May 04 '20 15:05 mvdbeek

Happy to get this in as a bugfix, I think we're almost there ?

mvdbeek avatar Sep 12 '20 07:09 mvdbeek

Rebased to fix a conflict.

nsoranzo avatar Sep 24 '20 22:09 nsoranzo