ncbi-cxx-toolkit-public
ncbi-cxx-toolkit-public copied to clipboard
blastdbcmd: concatenated/malformed header in FASTA output (default) for redundant sequences (nt DB)
When looking for very short sequences in a recent BLAST nt database retrieved using blastdbcmd
, I have found that some of these FASTA sequences (and many others not so short) have a very long header that seems the concatenation of many headers of similar type. Below is an example:
$ blastdbcmd -db nt -entry X57170.1
>X57170.1 B.taurus 5S rRNA gene >4W1Z_7 Chain 7, 5S ribosomal RNA >4W21_7 Chain 7, 5S ribosomal RNA >4W24_7 Chain 7, 5S ribosomal RNA >4W26_7 Chain 7, 5S ribosomal RNA >6FRK_7 Chain 7, 5S ribosomal RNA >3J7O_7 Chain 7, 5S ribosomal RNA >3J7Q_7 Chain 7, 5S ribosomal RNA >3J92_7 Chain 7, 5S rRNA >6LQM_5 Chain 5, 5S rRNA >6LSR_5 Chain 5, 5S rRNA >6LSS_5 Chain 5, 5S rRNA >6LU8_5 Chain 5, 5S rRNA >6FTG_v Chain v, 5S ribosomal RNA >6FTI_v Chain v, 5S ribosomal RNA >6FTJ_v Chain v, 5S ribosomal RNA >7BHP_L7 Chain L7, 5S ribosomal RNA >7A01_d2 Chain d2, 5S RIBOSOMAL RNA >6EK0_L7 Chain L7, 5S ribosomal RNA >6HCF_72 Chain 72, 5S ribosomal RNA >6HCJ_71 Chain 71, 5S ribosomal RNA >6HCM_72 Chain 72, 5S ribosomal RNA >6HCQ_71 Chain 71, 5S ribosomal RNA >3J7P_7 Chain 7, 5S ribosomal RNA >3J7R_7 Chain 7, 5S ribosomal RNA >3JAG_7 Chain 7, 5S ribosomal RNA >3JAH_7 Chain 7, 5S ribosomal RNA >3JAI_7 Chain 7, 5S ribosomal RNA >3JAJ_7 Chain 7, 5S ribosomal RNA >3JAN_7 Chain 7, 5S ribosomal RNA >5LZS_7 Chain 7, 5S ribosomal RNA >5LZT_7 Chain 7, 5S ribosomal RNA >5LZU_7 Chain 7, 5S ribosomal RNA >5LZV_7 Chain 7, 5S ribosomal RNA >5LZW_7 Chain 7, 5S ribosomal RNA >5LZX_7 Chain 7, 5S ribosomal RNA >5LZY_7 Chain 7, 5S ribosomal RNA >5LZZ_7 Chain 7, 5S ribosomal RNA >6MTB_7 Chain 7, 5S rRNA >6MTC_7 Chain 7, 5S rRNA >6MTD_7 Chain 7, 5S rRNA >6MTE_7 Chain 7, 5S rRNA >6QZP_L7 Chain L7, 5S rRNA (120-MER) >6R5Q_7 Chain 7, 5S rRNA >6R6G_7 Chain 7, 5S ribosomal RNA >6R6P_7 Chain 7, 5S ribosomal RNA >6R7Q_7 Chain 7, 5S ribosomal RNA >6SGC_74 Chain 74, 5S ribosomal RNA >6T59_74 Chain 74, 5S ribosomal RNA >6Y0G_L7 Chain L7, 5S rRNA >6Y2L_L7 Chain L7, 5S ribosomal RNA >6Y57_L7 Chain L7, 5S ribosomal RNA >6ZVK_d2 Chain d2, 5S RIBOSOMAL RNA >7NWG_71 Chain 71, 5S Ribosomal RNA >7NWH_7 Chain 7, 5S Ribosomal RNA >7NWI_7 Chain 7, 5S ribosomal RNA >7OBR_7 Chain 7, 5S ribosomal RNA >7MDZ_7 Chain 7, 5S rRNA >7CPU_L7 Chain L7, Mus musculus 5S ribosomal RNA >7CPV_L7 Chain L7, Mus musculus 5S ribosomal RNA >7QWR_7 Chain 7, 5S ribosomal RNA >7QWS_7 Chain 7, 5S rRNA >7QWQ_7 Chain 7, 5S ribosomal RNA >7TOQ_A5S Chain A5S, 5S rRNA >7TOR_A5S Chain A5S, 5S rRNA >7UCK_7 Chain 7, 5S rRNA >7UCJ_7 Chain 7, 5S rRNA >7OYD_K Chain K, 5S rRNA >7TM3_u Chain u, 5S ribosomal RNA >7TUT_u Chain u, 5S ribosomal RNA >8B6C_7 Chain 7, 5S rRNA >8B5L_7 Chain 7, 5S rRNA >8G5Y_L7 Chain L7, 5S rRNA >8GLP_L7 Chain L7, 5S rRNA >8BTK_B7 Chain B7, 5S rRNA >8BPO_B1 Chain B1, 5S ribosomal RNA >8P2K_B7 Chain B7, 5S rRNA >8IDT_5 Chain 5, 5S rRNA >8IDY_5 Chain 5, 5S rRNA >8IE3_5 Chain 5, 5S rRNA >8INE_5 Chain 5, 5S rRNA >8INF_5 Chain 5, 5S rRNA >8IR1_W Chain W, 5S RNA >8IR3_5 Chain 5, 5S RNA
GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTT
GGATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT
Trying other entry that is part of that long header leads to exactly the same result:
$ blastdbcmd -db nt -entry 7UCK_7
>X57170.1 B.taurus 5S rRNA gene >4W1Z_7 Chain 7, 5S ribosomal RNA >4W21_7 Chain 7, 5S ribosomal RNA >4W24_7 Chain 7, 5S ribosomal RNA >4W26_7 Chain 7, 5S ribosomal RNA >6FRK_7 Chain 7, 5S ribosomal RNA >3J7O_7 Chain 7, 5S ribosomal RNA >3J7Q_7 Chain 7, 5S ribosomal RNA >3J92_7 Chain 7, 5S rRNA >6LQM_5 Chain 5, 5S rRNA >6LSR_5 Chain 5, 5S rRNA >6LSS_5 Chain 5, 5S rRNA >6LU8_5 Chain 5, 5S rRNA >6FTG_v Chain v, 5S ribosomal RNA >6FTI_v Chain v, 5S ribosomal RNA >6FTJ_v Chain v, 5S ribosomal RNA >7BHP_L7 Chain L7, 5S ribosomal RNA >7A01_d2 Chain d2, 5S RIBOSOMAL RNA >6EK0_L7 Chain L7, 5S ribosomal RNA >6HCF_72 Chain 72, 5S ribosomal RNA >6HCJ_71 Chain 71, 5S ribosomal RNA >6HCM_72 Chain 72, 5S ribosomal RNA >6HCQ_71 Chain 71, 5S ribosomal RNA >3J7P_7 Chain 7, 5S ribosomal RNA >3J7R_7 Chain 7, 5S ribosomal RNA >3JAG_7 Chain 7, 5S ribosomal RNA >3JAH_7 Chain 7, 5S ribosomal RNA >3JAI_7 Chain 7, 5S ribosomal RNA >3JAJ_7 Chain 7, 5S ribosomal RNA >3JAN_7 Chain 7, 5S ribosomal RNA >5LZS_7 Chain 7, 5S ribosomal RNA >5LZT_7 Chain 7, 5S ribosomal RNA >5LZU_7 Chain 7, 5S ribosomal RNA >5LZV_7 Chain 7, 5S ribosomal RNA >5LZW_7 Chain 7, 5S ribosomal RNA >5LZX_7 Chain 7, 5S ribosomal RNA >5LZY_7 Chain 7, 5S ribosomal RNA >5LZZ_7 Chain 7, 5S ribosomal RNA >6MTB_7 Chain 7, 5S rRNA >6MTC_7 Chain 7, 5S rRNA >6MTD_7 Chain 7, 5S rRNA >6MTE_7 Chain 7, 5S rRNA >6QZP_L7 Chain L7, 5S rRNA (120-MER) >6R5Q_7 Chain 7, 5S rRNA >6R6G_7 Chain 7, 5S ribosomal RNA >6R6P_7 Chain 7, 5S ribosomal RNA >6R7Q_7 Chain 7, 5S ribosomal RNA >6SGC_74 Chain 74, 5S ribosomal RNA >6T59_74 Chain 74, 5S ribosomal RNA >6Y0G_L7 Chain L7, 5S rRNA >6Y2L_L7 Chain L7, 5S ribosomal RNA >6Y57_L7 Chain L7, 5S ribosomal RNA >6ZVK_d2 Chain d2, 5S RIBOSOMAL RNA >7NWG_71 Chain 71, 5S Ribosomal RNA >7NWH_7 Chain 7, 5S Ribosomal RNA >7NWI_7 Chain 7, 5S ribosomal RNA >7OBR_7 Chain 7, 5S ribosomal RNA >7MDZ_7 Chain 7, 5S rRNA >7CPU_L7 Chain L7, Mus musculus 5S ribosomal RNA >7CPV_L7 Chain L7, Mus musculus 5S ribosomal RNA >7QWR_7 Chain 7, 5S ribosomal RNA >7QWS_7 Chain 7, 5S rRNA >7QWQ_7 Chain 7, 5S ribosomal RNA >7TOQ_A5S Chain A5S, 5S rRNA >7TOR_A5S Chain A5S, 5S rRNA >7UCK_7 Chain 7, 5S rRNA >7UCJ_7 Chain 7, 5S rRNA >7OYD_K Chain K, 5S rRNA >7TM3_u Chain u, 5S ribosomal RNA >7TUT_u Chain u, 5S ribosomal RNA >8B6C_7 Chain 7, 5S rRNA >8B5L_7 Chain 7, 5S rRNA >8G5Y_L7 Chain L7, 5S rRNA >8GLP_L7 Chain L7, 5S rRNA >8BTK_B7 Chain B7, 5S rRNA >8BPO_B1 Chain B1, 5S ribosomal RNA >8P2K_B7 Chain B7, 5S rRNA >8IDT_5 Chain 5, 5S rRNA >8IDY_5 Chain 5, 5S rRNA >8IE3_5 Chain 5, 5S rRNA >8INE_5 Chain 5, 5S rRNA >8INF_5 Chain 5, 5S rRNA >8IR1_W Chain W, 5S RNA >8IR3_5 Chain 5, 5S RNA
GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTT
GGATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT
It seems blastdbcmd
is concatenating all the headers of the entries with the same sequence when outputting each one of the redundant entries in FASTA format (default outfmt
). The concatenated header is malformed as it contains the reserved character for the start of the header (>
) repeated times.
When obtaining the entire BLAST nt database (Oct 13, 2023 version) in fasta format with blastdbcmd -db nt -entry all
this problem arises too: the issue many be present in more than 2.6 million sequences of the approx. 100 million sequences (2.69% of sequences potentially affected). I am using blastdbcmd 2.13.0+
.
One of the worst cases appears when using entry XR_008229698.1
(or any other entry with the same sequence), as it seems there are more than 2400 entries (with exactly the same sequence), all of which end up packed into the same header.
(I forwarded this issue to the BLAST team.)
Hi @khyox ,
Does the addition of the -target_only
command line option address your needs?
$ blastdbcmd -db nt -entry X57170.1 -target_only
>X57170.1 B.taurus 5S rRNA gene
GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTT
GGATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT
This is not the right place to discuss BLAST problems I believe, the right place is specified here https://support.nlm.nih.gov/knowledgebase/article/KA-05184/en-us
@gouriano - it's not a problem I reckon to discuss it wherever it's most convenient/expedient for the users/devs to communicate... so, unless there are any particular reason why it's specifically bad to discuss it in this media (LMK if there is) it's okay.
Thanks all, @vakatov, @christiam, @gouriano. I found blastdbcmd.cpp
in this NCBI public repo (ncbi-cxx-toolkit-public
) at path /src/app/[blastdb/
. Without any relevant indication on the README preventing bug reports using GitHub Issues, I proceeded to report it here. From the conversation, I assume I can keep communicating using this media about this, but LMK otherwise.
@christiam, thanks. My use case is retrieving the entire nt DB with blastdbcmd -db nt -entry all
for downstream processing. Unfortunately, the workaround you mention does not work for this use case.
In case some dates may help, I have checked previous versions of the entire nt database retrieved in this same way, and I can trace the problem back at least until June of 2022. The fraction of sequences affected is similar in all cases. The last version of nt that I retrieved this way but is free from this problem is a version I got in November of 2021. I know this may be of limited utility as I cannot provide specifics about the version numbers, but maybe it gives an idea about when this appeared for first time and hopefully why. Thanks.
@khyox , a possible workaround would be to extract all the sequence IDs from nt, then extract the FASTA for each sequence using the the -target_only
command line option. An outline follows in the pseudo-code below (parallelization is left as an exercise for the reader ;) ). The resulting FASTA will end up having redundant sequences, but won't have multiple sequences in the title. I hope this helps!
blastdbcmd -db nt -dbtype nucl -outfmt %g -out nt-gis.out
for each gi in nt-gis.out {
blastdbcmd -entry $gi -db nt -dbtype nucl -target_only >> nt.fsa
}
Thanks @christiam! I'll do that in parallel (as exercise ;) ) for the next nt version that I'll retrieve soon if the bug is still there. Now, I have several downstream versions (masked, decontaminated, etc.) that I am processing with a python script to "decompress" the multiple concatenated headers into multiple redundant sequences —that takes way less than reprocessing everything. Are you targeting a specific upcoming version of BLAST+ and date for this bug to be solved? 2.16.0 if not 2.15.1, I guess. Thanks.
@khyox, could you please share what specification for the FASTA format you are using? For my own understanding, what file(s) or output formats would be more helpful to you? Thanks!
Many thanks @christiam.
Sorry, should I understand that this was intentionally introduced as a feature? I believed that the practice of using multiple headers separated by ^A
was deprecated. I think the old FASTA format from Lipman&Pearson allowed that, but I don't think many current software is prepared to understand files like that. I even cannot find references for that in the current version of that software (fasta-36.3.8i, May 2023; see section 2.2 in their fasta_guide.pdf). My assumption was that the NCBI standard was not considering multiple headers —for instance, see Genbank's FASTA format, where there is no mention to that. Are you using a different, specific standard for the FASTA output of blastdbcmd
? I so, I can check that and add a parser to all the pipelines using its output.
Anyway, if that "multi-header" was introduced as a feature in blastdbcmd
, I don't think it is a good idea to have that as the default behavior, because:
- It seems deprecated and current software reading FASTA is not prepared for that.
- Previous versions of
blastdbcmd
were not doing that. - It's a potential source of problems downstream since that change of format may go silent for a while —this happened to me as I discovered the problem after more than a year of the first time that it was present in the nt DB in FASTA format as mentioned in an earlier message.
Happy to discuss further!
A colleague suggested a 4th point to the list above: the potential for buffer overflows. For instance, a code reading the entry for XR_008229698.1
in the nt fasta file may not be prepared for the huge header made by the concatenation of more than 2400 individual headers.
@khyox - looks like you got your answer, so I am formally closing this issue. Feel free to re-open if/as needed.