vg autoindex does not create pantranscriptome wth gfa, fasta and gtf
1. What were you trying to do? Create a pantranscriptome from pansified FASTA and GTF
2. What did you want to happen? Pantranscriptome to be created, or get helpful error messages that indicate why it can't be
3. What actually happened? program stops with obtuse messages
4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:
Place stacktrace here.
5. What data and command can the vg dev team use to make the problem happen?
vg autoindex --workflow rpvg --gfa /data/ecoli-small.pggb/few-ecoli-genomes.smooth.final.gfa --ref-fasta /data/few-ecoli-genomes.fasta --tx-gff /data/final.gtf --hap-tx-gff /data/final.gtf --prefix /data/ecoli-small --threads 4
[IndexRegistry]: Checking for haplotype lines in GFA.
[vg autoindex] Guessing that /data/ecoli-small.spliced.xg is Spliced XG
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Haplotype GTF/GFF
Reference FASTA
Reference GFA
Spliced XG
I created the gfa from 10 ecoli assemblies with contig names pansnified (only 1 contig per sample). I combined the accompanying GTFs using a script that only retains the contigs in the FASTA, pansnifies the contig names, replaaces blank transcript names with gene name and ensures transcript record fields are tab separated (last two were needed to make vg autoindex --workflow mpmap work).
6. What does running vg version say?
1.64.0
It looks like you're missing the last line of the error message, where it reports what kind of index it couldn't figure out how to make:
https://github.com/vgteam/vg/blob/1d109aa7bec5034cc92cd8fdbec8644b352e1b35/src/index_registry.cpp#L5710-L5711
I think your problem is probably that the rpvg workflow needs haplotypes. Those should come from the GFA file, given how you say you built it from 10 assemblies. But autoindex is seeing it as a "Reference GFA" and not a "Reference GFA w/ Haplotypes", which is what it would need to see in order to use haplotypes from it for pantranscriptime generation.
I think this is because it looked for haplotypes in it and didn't find any:
https://github.com/vgteam/vg/blob/1d109aa7bec5034cc92cd8fdbec8644b352e1b35/src/subcommand/autoindex_main.cpp#L346-L351
That in turn can happen because your P line names aren't sample#hap num#contig PanSN names (maybe they're just sample#contig?), or because you don't actually have any P or W lines, or because all the samples in your P and W lines are tagged as references in the RS tag on the GFA header line. Are any of those things true about your GFA?
Maybe the documentation/message change here should be that it needs to explain why it is categorizing each input file as whatever it thinks it is, because I think that's where it's going wrong here.
We also might need to think about whether making a distinction between "having haplotypes" and not really makes sense, or whether we ought to count any GFA with multiple samples in it (or maybe just a GFA with a single sample for the degenerate case?) as having the information we need. But we need to balance that against not telling people we've made them a pantranscriptome when they were supposed to give us a phased VCF or a GBZ file and they actually forgot to provide the haplotypes they were thinking of, or e.g. they handed us a VCF we didn't actually end up using alongside a graph with just a reference.
Oh yes, you're right. The complete message is this:
vg autoindex --workflow mpmap --workflow rpvg --gfa /data/ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --ref-fasta /data/few-ecoli-genomes.fasta --tx-gff /data/final.gtf --hap-tx-gff /data/final.gtf --prefix /data/ecoli-small --threads 4
[IndexRegistry]: Checking for haplotype lines in GFA.
[vg autoindex] Guessing that /data/ecoli-small.spliced.dist is Spliced Distance Index
[vg autoindex] Guessing that /data/ecoli-small.spliced.gcsa is Spliced GCSA
[vg autoindex] Guessing that /data/ecoli-small.spliced.gcsa.lcp is Spliced LCP
[vg autoindex] Guessing that /data/ecoli-small.spliced.xg is Spliced XG
error:[vg autoindex] Input is not sufficient to create indexes
Inputs
GTF/GFF
Haplotype GTF/GFF
Reference FASTA
Reference GFA
Spliced Distance Index
Spliced GCSA
Spliced LCP
Spliced XG
are insufficient to create target index Haplotype-Transcript GBWT
These are the contigs in GTF:
$ cut -f1 final.gtf | uniq
#gtf-version 2.2
#!genome-build ASM584v2
#!genome-build-accession NCBI_Assembly:GCF_000005845.2
GCF_000005845#1#NC_000913.3
GCF_000008865#1#NC_002695.2
GCF_019265485#1#NZ_CP078557.1
GCF_019269375#1#NZ_CP078580.1
GCF_048002555#1#NZ_CP181460.1
GCF_900096805#1#NZ_LT615375.1
GCF_004138625#1#NZ_CP035317.1
GCF_003194265#1#NZ_CP023258.1
GCF_030142785#1#NZ_CP126294.1
GCF_030012705#1#NZ_CP122895.1
Same as the contigs in FASTA:
$ grep '>' few-ecoli-genomes.fasta | cut -f1 -d ' '
>GCF_000005845#1#NC_000913.3
>GCF_000008865#1#NC_002695.2
>GCF_019265485#1#NZ_CP078557.1
>GCF_019269375#1#NZ_CP078580.1
>GCF_048002555#1#NZ_CP181460.1
>GCF_900096805#1#NZ_LT615375.1
>GCF_004138625#1#NZ_CP035317.1
>GCF_003194265#1#NZ_CP023258.1
>GCF_030142785#1#NZ_CP126294.1
>GCF_030012705#1#NZ_CP122895.1
same as the contig names in the GFA:
$ grep '^P' ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa | cut -f2
GCF_000005845#1#NC_000913.3
GCF_000008865#1#NC_002695.2
GCF_019265485#1#NZ_CP078557.1
GCF_019269375#1#NZ_CP078580.1
GCF_048002555#1#NZ_CP181460.1
GCF_900096805#1#NZ_LT615375.1
GCF_004138625#1#NZ_CP035317.1
GCF_003194265#1#NZ_CP023258.1
GCF_030142785#1#NZ_CP126294.1
GCF_030012705#1#NZ_CP122895.1
This is the command I used to makes the GFA from FASTA:
pggb -i /data/few-ecoli-genomes.fasta.gz -o /data/ecoli-small.pggb -p 90 -n 10 -t 4
Happy to look into the files deeper or run any commands required to get to the bottom of this.
What's the H line from your GFA? I'm not sure if PGGB just tags everybody as a reference or not.
You could also try vg paths -x ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --metadata to see how vg parses the paths in the GFA and whether it thinks they're reference or haplotype sense paths (when fully parsing it, which the haplotype sniffing doesn't do...).
Output of both commands (it's the same GFA in both cases, 2nd one looks different 'cos alias for docker run):
$ /bin/grep '^H' ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa | cut -f2
VN:Z:1.0
$ dvg paths -x /data/ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --metadata
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
GCF_048002555#1#NZ_CP181460.1#0 HAPLOTYPE GCF_048002555 1 NZ_CP181460.1 0 NO_SUBRANGE
GCF_000005845#1#NC_000913.3#0 HAPLOTYPE GCF_000005845 1 NC_000913.3 0 NO_SUBRANGE
GCF_003194265#1#NZ_CP023258.1#0 HAPLOTYPE GCF_003194265 1 NZ_CP023258.1 0 NO_SUBRANGE
GCF_900096805#1#NZ_LT615375.1#0 HAPLOTYPE GCF_900096805 1 NZ_LT615375.1 0 NO_SUBRANGE
GCF_000008865#1#NC_002695.2#0 HAPLOTYPE GCF_000008865 1 NC_002695.2 0 NO_SUBRANGE
GCF_004138625#1#NZ_CP035317.1#0 HAPLOTYPE GCF_004138625 1 NZ_CP035317.1 0 NO_SUBRANGE
GCF_019269375#1#NZ_CP078580.1#0 HAPLOTYPE GCF_019269375 1 NZ_CP078580.1 0 NO_SUBRANGE
GCF_019265485#1#NZ_CP078557.1#0 HAPLOTYPE GCF_019265485 1 NZ_CP078557.1 0 NO_SUBRANGE
GCF_030142785#1#NZ_CP126294.1#0 HAPLOTYPE GCF_030142785 1 NZ_CP126294.1 0 NO_SUBRANGE
GCF_030012705#1#NZ_CP122895.1#0 HAPLOTYPE GCF_030012705 1 NZ_CP122895.1 0 NO_SUBRANGE
OK, this looks like a bug in gfa_has_haplotypes(), because it thinks your GFA doesn't have haplotypes and it does. We'll have to try and reproduce and fix it.
@faithokamoto Did you run into anything like this or do you have a workaround you know?
Though cut -f2 is throwing away any other header tags that may be there. But the full GFA parse isn't seeing anything to mark these samples reference, so it's probably not there.
Output without cutting:
$ /bin/grep '^H' ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa
H VN:Z:1.0
I am happy to upload the FASTA if it would be helpful ... though it is just e. coli reference + 9 chromosomal assemblies picked more or less at random ...
@ohell Can you provide your input files for this command?
vg autoindex --workflow mpmap --workflow rpvg --gfa /data/ecoli-small.pggb/few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --ref-fasta /data/few-ecoli-genomes.fasta --tx-gff /data/final.gtf --hap-tx-gff /data/final.gtf --prefix /data/ecoli-small --threads 4
@adamnovak I have put the files in this shared folder. I have also included the various ecoli-small.* files created by the autoindexer before it failed.
Thanks again for your help.
I can try debugging this using the input files
Looks like the issue is sample#haplotype#contig is parsed as a diploid reference path. That makes autoindex assume all of the paths it sees are reference paths, instead of haploid haplotypes. vg paths uses the magic get_path_name() (which requires a handle) that somehow knows to add a #0 phase block to the end of the contig names. Having that fourth component means the names get parsed as haplotypes.
There are only two ways for vg autoindex to mark the GFA as having haplotypes: either it sees a P line where the path name has the fourth component, or it sees a W line with a sample which doesn't match the samples in the header. However this file doesn't seem to have a ref sense tag in the header so that latter way is out the window.
I dunno what magic allows vg paths/get_path_name() to conjure up the #0 when it doesn't appear in the GFA, but whatever it is, that's what the haplotype sniffing needs.
We put the #0 on the end when we think the path is haplotype-sense in the internal path database.
We shouldn't need a reference samples tag in the header; if there isn't one we should interpret all W lines as haplotypes, just like if we have an empty tag or a tag that doesn't mention their damples.
OK, I have a PR #4622 that will partly fix this; we were just looking at the P line names and not the header at all to decide if they were haplotypes or not. I've stopped doing that.
It also adds a real error message about re-used transcript IDs in the annotations file, instead of some assertion failures I was hitting. @ohell can you adjust your annotations to not re-use the same transcript ID (i.e. unassigned_transcript_198) across multiple haplotypes? Or is that something you think you really ought to be able to do, and we should change vg to support it?
I don't think we need to change the behavior of adding #0 to the haplotype path names when we load them up. We have a system of "base" path names and path fragments that should bridge that gap, so when the user provides e.g. transcripts on the path name without the trailing #0, we look for and use the appropriate overlapping fragment paths with trailing fragment data to actually interpret the input. So you don't need to change the transcript files to add #0 to all the path names.
can you adjust your annotations to not re-use the same transcript ID (i.e. unassigned_transcript_198) across multiple haplotypes? Or is that something you think you really ought to be able to do, and we should change vg to support it?
the issue is that for a pangenome we will usually be combining assembly-specific GTF files (as I did to make final.gtf, and in that parocess ensuring unique id's, while possible, would be hassle. Especially as all pantransciptome makers would need to do it, so it makes sense for indexing process to be robust to duplicates, imo.
So you don't need to change the transcript files to add #0 to all the path names.
but I do need to make this transformation to records in (e,g.) GCF_000005845_ASM584v2.gtf?
NC_000913.3 RefSeq gene 190 255 . ...
to
CF_000005845#1#NC_000913.3 RefSeq gene 190 255 . ...
`
You do need to add the sample name and haplotype index to the accession in the contig field in the GTF. We only sometimes use accessions in path names in the graph, and we need to distinguish HG002#1#chr1 and HG005#1#chr1, and vg doesn't know that NC_000913.3 is a unique name and not the name of a chromosome where different individuals have different instances.
Also, that's what the names are in your input graph. If you got an input graph with just the accessions in the P lines, we could try and add some overrides to let you force it to use those generic paths instead of complaining about not having haplotype paths. But that's not the shape of graph we usually use.
@faithokamoto mentioned today that when she was using these tools she was using human annotations where we already had unique transcript IDs for different individuals/source contigs. If doing the transformation to ensure that yourself is too hard, we could try and add a feature to vg to rename the transcripts on the fly when loading them. Do you need to look things up by the original transcript names at any point?
deduplicating transcript IDs before passing to vg is certainly possible.
but my point is that almost every user attempting to create a pantranscriptome from a set of assemblies and corresponding annotations would need to do this because uniqueness of IDs across individual assemblies can never be assumed. That's why it makes sense for vg autoindex to provide the disambiguation function.
You can mention the ID disambiguation strategy you use (e.g. appending ~nrecord) so users have a simple way of relating names with pantranscriptome to the input transcript IDs.
Any particular disambiguation strategy you think would be best? I can probably add it in. (This got dropped for a bit, sorry)
I wrote a simple awk script to maintain a hash table of seen counts for ids, and append __count to the id if seen multiple times. Guess you can follow the same strategy in code? Here's the logic (for GFF3 format annotations)
Hi, sorry to bother you. I'm also trying to use a smooth.final.gfa from PGGB as input to vg autoindex, and it's been running for over four days now without finishing.
vg autoindex -p chr1 -t 50 -w giraffe -g all_chr1.fa.bf3285f.ce127b4.e7c04b4.smooth.final.gfa
vg view --threads 20 -F -v all_chr1.fa.bf3285f.ce127b4.e7c04b4.smooth.final.gfa > chr1.vg
vg stats -p 20 -z -N -E -l -L chr1.vg
nodes 19231195
edges 26526127
19231195
26526127
length 355343019
self-loops 102
This is my first time using vg autoindex, so I'm not sure whether this runtime is expected or if something might be going wrong.
Any insights would be greatly appreciated. Thank you!
@douya1107 that seems like an entirely unrelated issue - please don't mix up issue threads with each other; open a new issue if you have a different problem
I changed the transcript-parsing code to automatically de-duplicate transcript IDs by appending _# to the end of duplicates. However, this only revealed another issue: while I can successfully run vg rna on a GBZ I manually create from the GFA, using autoindex causes chromosome-not-found errors. That is,
vg autoindex --workflow mpmap --workflow rpvg --gfa few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa --ref-fasta few-ecoli-genomes.fasta.gz --tx-gff final.gtf --hap-tx-gff final.gtf --prefix ecoli-small --threads 4
eventually leads to
ERROR: Chromosome path "GCF_004138625#1#NZ_CP035317.1" not found in graph or haplotypes index (line 157).
due to the GTF file using chromosome names like GCF_004138625#1#NZ_CP035317.1 but the graph having automatically appended #0 to the end of haplotype paths.
Notably, we do actually manage to parse the GTF once. ("Constructing haplotype-transcript GBWT and spliced graph from GBZ-format graph.") autoindex first constructs a GBZ from the GFA and then uses that GBZ and the GTF to created a spliced graph. A similar sequence of events outside autoindex is:
GFA=few-ecoli-genomes.fasta.gz.bf3285f.11fba48.42e55e5.smooth.final.gfa
vg convert --gfa-in $GFA --xg-out > few-ecoli-genomes.xg
vg gbwt -o paths.gbwt --index-paths -x few-ecoli-genomes.xg
vg gbwt --gbz-format -x few-ecoli-genomes.xg paths.gbwt -g few-ecoli-genomes.gbz
vg rna --progress -n final.gtf --add-ref-paths --use-hap-ref \
--gbz-format few-ecoli-genomes.gbz > few-ecoli-genomes.spliced.vg
But then after we've made that spliced graph, autoindex then makes another spliced graph ("Constructing spliced VG graph from FASTA input.") and fails that time
due to the GTF file using chromosome names like GCF_004138625#1#NZ_CP035317.1 but the graph having automatically appended #0 to the end of haplotype paths.
We have machinery used in vg chunk and other tools that take user-facing coordinates and ranges to internally deal with finding base path ranges when the graph contains subpaths (like #0, but also handling the fragmentary paths for e.g. GRCh38 in the CHM13-based HPRC graphs).
Probably we have to engage that machinery here. To do a coordinate lookup, instead of looking for a path with that exact name, you would fill in a Region object describing the path range you want:
https://github.com/vgteam/vg/blob/47ef5b83d8e07c11b5adb031b759ffcd1705a72d/src/region.hpp#L12-L19
And you would call find_containing_subpath() to find the path handle for the graph path that contains that coordinate range, if one exists. If one doesn't exist, you'd need to handle that somehow (maybe the chromosome name is wrong, or maybe there's just annotations on a part of that chromosome that isn't in this graph).
https://github.com/vgteam/vg/blob/47ef5b83d8e07c11b5adb031b759ffcd1705a72d/src/path.hpp#L387-L413
I can get the Region fine, but the Transcriptome's _graph actually isn't a PathPositionHandleGraph, it's a MutablePathDeletableHandleGraph (or a unique_ptr to one, to be exact). There is a bdsg::PositionOverlay available as well. I can't quickly see how to use a PathHandleGraph to call find_containing_subpath() or a similar function?
(function ref: https://github.com/vgteam/vg/blob/47ef5b83d8e07c11b5adb031b759ffcd1705a72d/src/transcriptome.cpp#L539)
OK, so the PositionOverlay that's available there implements PathPositionHandleGraph, so you could call find_containing_subpath() with the PositionOverlay and get a path handle in the overlay to the relevant subpath.
The current transcriptome.cpp code seems to mix path handles in the overlay and path handles in the base graph, like here:
https://github.com/vgteam/vg/blob/47ef5b83d8e07c11b5adb031b759ffcd1705a72d/src/transcriptome.cpp#L531
That upsets me and I'd think we'd need a conversion method on the overlay (like we have for getting the base graph handle), or a note in the docs for the overlay that you can do this. But it seems to work. So I guess you could add more code that does it, and just use the path handle you get out of the overlay with the base graph.
The whole thing might be better if it took one graph that implemented everything (which could be fulfilled with a MutablePositionOverlay). We might need to define a new name for that interface in the handle graph type system if we wanted to make that change, or directly take a MutablePositionOverlay since we're already directly taking overlays.
Well, I was silly and that wasn't the issue at all. It couldn't find the right contig because it can't read FASTA files. Also see #4567. The reason I wasn't seeing transcript paths wasn't due to dropping wrong-contig genes, as I mistakenly assumed, but because we had failed to embed the paths into the graph. Also see #4584. This should work now:
vg autoindex --workflow mpmap --workflow rpvg \
--gfa [few-ecoli-genomes.fasta.gz](http://few-ecoli-genomes.fasta.gz/).bf3285f.11fba48.42e55e5.smooth.final.gfa \
--ref-fasta few-ecoli-genomes.fasta \
--tx-gff final.gtf --hap-tx-gff final.gtf --prefix ecoli-small --threads 4
Hey, did the above resolve your problem?