Stop using PathMetadata::parse_sense() to try and detect haplotype vs. reference from name alone
Changelog Entry
To be copied to the draft changelog by merger:
- Fix GFA haplotype sniffing for GFAs with P-lines
- Use graph metadata and not path name to determine reference/haplotype status for paths in
vg callandvg deconstruct. - Loading transcript files will now produce a human-readable error message when there are duplicate transcripts with the same ID on different paths.
Description
This will hopefully help with #4566. I've stopped trying to use PathMetadata::parse_sense() to determine if a path is reference or haplotype from just its name. This fixes GFA haplotype sniffing in vg autoindex for GFAs with P-lines. Probably we should deprecate that function, or just make it tell generic from non-generic.
This doesn't fully solve #4566 though. Now when I run:
vg autoindex --workflow rpvg --gfa few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --tx-gff final.gtf --hap-tx-gff final.gtf --prefix ecoli-small --threads 4 --verbosity 2
I get an assertion failure:
[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
Assertion failed: (transcript->is_reverse == is_reverse), function parse_transcripts, file transcriptome.cpp, line 680.
I've changed the problem I get with that input data to an error message: ERROR: There are multiple distinct transcripts with ID unassigned_transcript_198. We have already seen a transcript unassigned_transcript_198 on strand 1 of contig GCF_000005845#1#NC_000913.3 length 4294967295 but on line 18815 there is a transcript unassigned_transcript_198 on strand 0 of contig GCF_000008865#1#NC_002695.2 length 4294967295. Fix the input transcript set to not re-use transcript IDs.
@faithokamoto do you have insight into what they did in their GTF file? I guess we expect them to uniquely name transcripts across all input haplotypes?
Yeah, I'm pretty sure that we name the transcript paths by the names in the GTF files, so having duplicate names is an issue. Possible fix would be to have an option to prefix transcripts by their contig name. I'm pretty sure that contig names shouldn't repeat, so that should make them unique.
Of course that prefixing could also be done by the user outside of vg, which is probably better?
@faithokamoto It looks like the last two commits here made this start failing dome of the autoindex tests:
t/52_vg_autoindex.t (Wstat: 0 Tests: 55 Failed: 17)
Failed tests: 10-12, 14-27
Files=56, Tests=1220, 1126 wallclock secs ( 0.70 usr 0.33 sys + 2212.17 cusr 512.32 csys = 2725.52 CPU)
Result: FAIL
Do you want to fix them, or should I do that?
I'll poke it
Got rid of one of the changes, hopefully that was the issue? The graph is still made as desired.