vg icon indicating copy to clipboard operation
vg copied to clipboard

Stop using PathMetadata::parse_sense() to try and detect haplotype vs. reference from name alone

Open adamnovak opened this issue 6 months ago • 3 comments

Changelog Entry

To be copied to the draft changelog by merger:

  • Fix GFA haplotype sniffing for GFAs with P-lines
  • Use graph metadata and not path name to determine reference/haplotype status for paths in vg call and vg deconstruct.
  • Loading transcript files will now produce a human-readable error message when there are duplicate transcripts with the same ID on different paths.

Description

This will hopefully help with #4566. I've stopped trying to use PathMetadata::parse_sense() to determine if a path is reference or haplotype from just its name. This fixes GFA haplotype sniffing in vg autoindex for GFAs with P-lines. Probably we should deprecate that function, or just make it tell generic from non-generic.

This doesn't fully solve #4566 though. Now when I run:

vg autoindex --workflow rpvg --gfa few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --tx-gff final.gtf --hap-tx-gff final.gtf --prefix ecoli-small --threads 4 --verbosity 2

I get an assertion failure:

[IndexRegistry]: Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.
Assertion failed: (transcript->is_reverse == is_reverse), function parse_transcripts, file transcriptome.cpp, line 680.

adamnovak avatar May 30 '25 14:05 adamnovak

I've changed the problem I get with that input data to an error message: ERROR: There are multiple distinct transcripts with ID unassigned_transcript_198. We have already seen a transcript unassigned_transcript_198 on strand 1 of contig GCF_000005845#1#NC_000913.3 length 4294967295 but on line 18815 there is a transcript unassigned_transcript_198 on strand 0 of contig GCF_000008865#1#NC_002695.2 length 4294967295. Fix the input transcript set to not re-use transcript IDs.

@faithokamoto do you have insight into what they did in their GTF file? I guess we expect them to uniquely name transcripts across all input haplotypes?

adamnovak avatar May 30 '25 15:05 adamnovak

Yeah, I'm pretty sure that we name the transcript paths by the names in the GTF files, so having duplicate names is an issue. Possible fix would be to have an option to prefix transcripts by their contig name. I'm pretty sure that contig names shouldn't repeat, so that should make them unique.

faithokamoto avatar May 30 '25 15:05 faithokamoto

Of course that prefixing could also be done by the user outside of vg, which is probably better?

faithokamoto avatar May 30 '25 17:05 faithokamoto

@faithokamoto It looks like the last two commits here made this start failing dome of the autoindex tests:

t/52_vg_autoindex.t (Wstat: 0 Tests: 55 Failed: 17)
  Failed tests:  10-12, 14-27
Files=56, Tests=1220, 1126 wallclock secs ( 0.70 usr  0.33 sys + 2212.17 cusr 512.32 csys = 2725.52 CPU)
Result: FAIL

Do you want to fix them, or should I do that?

adamnovak avatar Jun 26 '25 17:06 adamnovak

I'll poke it

faithokamoto avatar Jun 26 '25 17:06 faithokamoto

Got rid of one of the changes, hopefully that was the issue? The graph is still made as desired.

faithokamoto avatar Jun 26 '25 18:06 faithokamoto