vg autoindex does not work with compressed FASTA files
1. What were you trying to do?
Run vg autoindex with bgzipped and indexed FASTA file
2. What did you want to happen?
It should work
3. What actually happened?
It did not work
[vg autoindex] Executing command: vg autoindex --workflow mpmap --gfa /data/few-ecoli-genomes.gfa --ref-fasta /data/few-ecoli-genomes.fasta.gz --tx-gff /data/few-ecoli-genomes.gtf --prefix /data/ecoli-small --threads 4
led to the error below, while runningthat command after inflating and indexing the same fasta succeeded
4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:
[IndexRegistry]: Checking for haplotype lines in GFA.
[IndexRegistry]: Chunking inputs for parallelism.
[IndexRegistry]: Chunking FASTA(s).
[IndexRegistry]: Chunking VCF(s).
[IndexRegistry]: Chunking GTF/GFF(s).
[IndexRegistry]: Constructing spliced VG graph from FASTA input.
warning:[vg::Constructor] Lowercase characters found in GCF_000008865#1#NC_002695.2; coercing to uppercase.
ERROR: Chromosome path "GCF_004138625#1#NZ_CP035317.1" not found in graph or haplotypes index (line 157).
warning:[vg::Constructor] Lowercase characters found in GCF_019265485#1#NZ_CP078557.1; coercing to uppercase.
warning:[vg::Constructor] Lowercase characters found in GCF_000005845#1#NC_000913.3; coercing to uppercase.
warning:[vg::Constructor] Unsupported IUPAC ambiguity codes found in GCF_000008865#1#NC_002695.2; coercing to N.
error:[vg::Constructor] unacceptable character "�" found in GCF_000008865#1#NC_002695.2 at index 1.
5. What data and command can the vg dev team use to make the problem happen?
see above
6. What does running vg version say?
1.64.0
This appears to be happening because the contig names in the GFA do not match the contig names in the VCF. This may be due to the Pan-SN style names in your GTF, which append extra info about the sample and haplotype to the contig identifier. A quick check to see the contig names in the GFA would be:
grep -E "^[PW]" /data/few-ecoli-genomes.gfa | less -S
Hello, thanks for the tip. But the command runs if I replace the FASTA with the uncompressed version. Also, I compared the contigs just now, looks OK:
GCF_000005845#1#NC_000913.3
GCF_000008865#1#NC_002695.2
GCF_019265485#1#NZ_CP078557.1
GCF_019269375#1#NZ_CP078580.1
GCF_048002555#1#NZ_CP181460.1
vs
$ /bin/grep GCF_ final.gtf | cut -f1 | uniq | head -n6
#!genome-build-accession NCBI_Assembly:GCF_000005845.2
GCF_000005845#1#NC_000913.3
GCF_000008865#1#NC_002695.2
GCF_019265485#1#NZ_CP078557.1
GCF_019269375#1#NZ_CP078580.1
GCF_048002555#1#NZ_CP181460.1
I'm not sure if we can accept gzipped fasta files. Unzipping files is annoying to do in C++; I remember because I tried poking around accepting gzipped GFF3 files some time back and eventually gave up. Would changing the code to give an explicit "no gzipped files allowed" error be sufficient? (see https://github.com/vgteam/vg/pull/4489)
Looking at this again, I don't think it's even necessary to provide the FASTA input. The FASTA is typically used when constructing a graph from VCF input. You already have a graph constructed as a GFA, which presumably includes this FASTA sequence inside it.
I'm not sure if we can accept gzipped fasta files. Unzipping files is annoying to do in C++; I remember because I tried poking around accepting gzipped GFF3 files some time back and eventually gave up. Would changing the code to give an explicit "no gzipped files allowed" error be sufficient? (see #4489)
@faithokamoto ideal would be to accept bgzipped FASTA files, just like HTSlib is able to (there are several solutions possible, I know boost:: iostreams have a filter for gzip streams, for example). But until/unless this is implemented, it would be good to improve the error message so users know what is going wrong.
Especially important for vg autoindex because it is posited as the high level utility to simplify pangenome construction for specific use cases, but it is actually an inconvenience if users have to spend effort trying to figure out what is wrong with the inputs they provided. Please see issue #4566 that is also about unhelpful error message.
Thank you
I'm pretty sure I see where I could change the autoindex code to catch gzipped fasta or gff3/gtf files. I don't think there are others we need to worry about? I think we're handling gzipped VCFs somehow, though I genuinely cannot see where. [I do not understand this particular part of vg.] Not sure if this needs to be handled for other subcommands as well...
We read possibly-gzipped files transparently (uncompressing them on the fly) through htslib's BGZF code. In a lot of places we use https://github.com/vgteam/libvgio/blob/c418502acb8799150d0d011ed6dd9f7bfaafec02/include/vg/io/blocked_gzip_input_stream.hpp which is a Protobuf-style stream wrapping htslib's reader, since we're usually using this to read Protobuf-based formats like GAM. One thing we don't have is an std::istream implementation on BGZF files.
We usually read FASTA files with FastaReference from the fastahack library: https://github.com/vgteam/fastahack/blob/75f12d25df9416b9d49b84c70dcc58406afce11a/Fasta.h#L60, but fastahack doesn't link in htslib and can't read BGZF files.
There are htslib FASTA functions like faidx_fetch_seq64 that know how to use a BGZF FASTA file (I think with a second .gzi index next to the .fai?), but we'd probably have to write a FastaReference-shaped C++-style wrapper around them to manage things like converting malloc'd memory blocks to C++ strings.
We probably shouldn't write our own istream over GBZF for formats we parse: it looks like one exists already that we could maybe use for GFF3/GTF.
@ekg Has anyone ever wanted .fa.gz support from fastahack before? Do you have a recommended solution for that?
hey Adam, no, but I was just monkey patching samtools so the .fa.gz reading is thread safe. Would that be helpful?
Oh no it's not thread safe already? We'd definitely want it to be if we wanted to put a fastahack-shaped facade over it. We could do a lot of locking and caching but that would be a pain and having the library do it would be better.
#4622 ended up adding an explicit error when you pass a gzipped fasta - are you good to close this issue now?
We might want to keep this open, since it would be nice if we could work with compressed FASTA files.