ncbi-cxx-toolkit-public icon indicating copy to clipboard operation
ncbi-cxx-toolkit-public copied to clipboard

blastdbcmd: concatenated/malformed header in FASTA output (default) for redundant sequences (nt DB)

Open khyox opened this issue 7 months ago • 10 comments

When looking for very short sequences in a recent BLAST nt database retrieved using blastdbcmd, I have found that some of these FASTA sequences (and many others not so short) have a very long header that seems the concatenation of many headers of similar type. Below is an example:

$ blastdbcmd -db nt -entry X57170.1
>X57170.1 B.taurus 5S rRNA gene >4W1Z_7 Chain 7, 5S ribosomal RNA >4W21_7 Chain 7, 5S ribosomal RNA >4W24_7 Chain 7, 5S ribosomal RNA >4W26_7 Chain 7, 5S ribosomal RNA >6FRK_7 Chain 7, 5S ribosomal RNA >3J7O_7 Chain 7, 5S ribosomal RNA >3J7Q_7 Chain 7, 5S ribosomal RNA >3J92_7 Chain 7, 5S rRNA >6LQM_5 Chain 5, 5S rRNA >6LSR_5 Chain 5, 5S rRNA >6LSS_5 Chain 5, 5S rRNA >6LU8_5 Chain 5, 5S rRNA >6FTG_v Chain v, 5S ribosomal RNA >6FTI_v Chain v, 5S ribosomal RNA >6FTJ_v Chain v, 5S ribosomal RNA >7BHP_L7 Chain L7, 5S ribosomal RNA >7A01_d2 Chain d2, 5S RIBOSOMAL RNA >6EK0_L7 Chain L7, 5S ribosomal RNA >6HCF_72 Chain 72, 5S ribosomal RNA >6HCJ_71 Chain 71, 5S ribosomal RNA >6HCM_72 Chain 72, 5S ribosomal RNA >6HCQ_71 Chain 71, 5S ribosomal RNA >3J7P_7 Chain 7, 5S ribosomal RNA >3J7R_7 Chain 7, 5S ribosomal RNA >3JAG_7 Chain 7, 5S ribosomal RNA >3JAH_7 Chain 7, 5S ribosomal RNA >3JAI_7 Chain 7, 5S ribosomal RNA >3JAJ_7 Chain 7, 5S ribosomal RNA >3JAN_7 Chain 7, 5S ribosomal RNA >5LZS_7 Chain 7, 5S ribosomal RNA >5LZT_7 Chain 7, 5S ribosomal RNA >5LZU_7 Chain 7, 5S ribosomal RNA >5LZV_7 Chain 7, 5S ribosomal RNA >5LZW_7 Chain 7, 5S ribosomal RNA >5LZX_7 Chain 7, 5S ribosomal RNA >5LZY_7 Chain 7, 5S ribosomal RNA >5LZZ_7 Chain 7, 5S ribosomal RNA >6MTB_7 Chain 7, 5S rRNA >6MTC_7 Chain 7, 5S rRNA >6MTD_7 Chain 7, 5S rRNA >6MTE_7 Chain 7, 5S rRNA >6QZP_L7 Chain L7, 5S rRNA (120-MER) >6R5Q_7 Chain 7, 5S rRNA >6R6G_7 Chain 7, 5S ribosomal RNA >6R6P_7 Chain 7, 5S ribosomal RNA >6R7Q_7 Chain 7, 5S ribosomal RNA >6SGC_74 Chain 74, 5S ribosomal RNA >6T59_74 Chain 74, 5S ribosomal RNA >6Y0G_L7 Chain L7, 5S rRNA >6Y2L_L7 Chain L7, 5S ribosomal RNA >6Y57_L7 Chain L7, 5S ribosomal RNA >6ZVK_d2 Chain d2, 5S RIBOSOMAL RNA >7NWG_71 Chain 71, 5S Ribosomal RNA >7NWH_7 Chain 7, 5S Ribosomal RNA >7NWI_7 Chain 7, 5S ribosomal RNA >7OBR_7 Chain 7, 5S ribosomal RNA >7MDZ_7 Chain 7, 5S rRNA >7CPU_L7 Chain L7, Mus musculus 5S ribosomal RNA >7CPV_L7 Chain L7, Mus musculus 5S ribosomal RNA >7QWR_7 Chain 7, 5S ribosomal RNA >7QWS_7 Chain 7, 5S rRNA >7QWQ_7 Chain 7, 5S ribosomal RNA >7TOQ_A5S Chain A5S, 5S rRNA >7TOR_A5S Chain A5S, 5S rRNA >7UCK_7 Chain 7, 5S rRNA >7UCJ_7 Chain 7, 5S rRNA >7OYD_K Chain K, 5S rRNA >7TM3_u Chain u, 5S ribosomal RNA >7TUT_u Chain u, 5S ribosomal RNA >8B6C_7 Chain 7, 5S rRNA >8B5L_7 Chain 7, 5S rRNA >8G5Y_L7 Chain L7, 5S rRNA >8GLP_L7 Chain L7, 5S rRNA >8BTK_B7 Chain B7, 5S rRNA >8BPO_B1 Chain B1, 5S ribosomal RNA >8P2K_B7 Chain B7, 5S rRNA >8IDT_5 Chain 5, 5S rRNA >8IDY_5 Chain 5, 5S rRNA >8IE3_5 Chain 5, 5S rRNA >8INE_5 Chain 5, 5S rRNA >8INF_5 Chain 5, 5S rRNA >8IR1_W Chain W, 5S RNA >8IR3_5 Chain 5, 5S RNA
GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTT
GGATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT

Trying other entry that is part of that long header leads to exactly the same result:

$ blastdbcmd -db nt -entry 7UCK_7
>X57170.1 B.taurus 5S rRNA gene >4W1Z_7 Chain 7, 5S ribosomal RNA >4W21_7 Chain 7, 5S ribosomal RNA >4W24_7 Chain 7, 5S ribosomal RNA >4W26_7 Chain 7, 5S ribosomal RNA >6FRK_7 Chain 7, 5S ribosomal RNA >3J7O_7 Chain 7, 5S ribosomal RNA >3J7Q_7 Chain 7, 5S ribosomal RNA >3J92_7 Chain 7, 5S rRNA >6LQM_5 Chain 5, 5S rRNA >6LSR_5 Chain 5, 5S rRNA >6LSS_5 Chain 5, 5S rRNA >6LU8_5 Chain 5, 5S rRNA >6FTG_v Chain v, 5S ribosomal RNA >6FTI_v Chain v, 5S ribosomal RNA >6FTJ_v Chain v, 5S ribosomal RNA >7BHP_L7 Chain L7, 5S ribosomal RNA >7A01_d2 Chain d2, 5S RIBOSOMAL RNA >6EK0_L7 Chain L7, 5S ribosomal RNA >6HCF_72 Chain 72, 5S ribosomal RNA >6HCJ_71 Chain 71, 5S ribosomal RNA >6HCM_72 Chain 72, 5S ribosomal RNA >6HCQ_71 Chain 71, 5S ribosomal RNA >3J7P_7 Chain 7, 5S ribosomal RNA >3J7R_7 Chain 7, 5S ribosomal RNA >3JAG_7 Chain 7, 5S ribosomal RNA >3JAH_7 Chain 7, 5S ribosomal RNA >3JAI_7 Chain 7, 5S ribosomal RNA >3JAJ_7 Chain 7, 5S ribosomal RNA >3JAN_7 Chain 7, 5S ribosomal RNA >5LZS_7 Chain 7, 5S ribosomal RNA >5LZT_7 Chain 7, 5S ribosomal RNA >5LZU_7 Chain 7, 5S ribosomal RNA >5LZV_7 Chain 7, 5S ribosomal RNA >5LZW_7 Chain 7, 5S ribosomal RNA >5LZX_7 Chain 7, 5S ribosomal RNA >5LZY_7 Chain 7, 5S ribosomal RNA >5LZZ_7 Chain 7, 5S ribosomal RNA >6MTB_7 Chain 7, 5S rRNA >6MTC_7 Chain 7, 5S rRNA >6MTD_7 Chain 7, 5S rRNA >6MTE_7 Chain 7, 5S rRNA >6QZP_L7 Chain L7, 5S rRNA (120-MER) >6R5Q_7 Chain 7, 5S rRNA >6R6G_7 Chain 7, 5S ribosomal RNA >6R6P_7 Chain 7, 5S ribosomal RNA >6R7Q_7 Chain 7, 5S ribosomal RNA >6SGC_74 Chain 74, 5S ribosomal RNA >6T59_74 Chain 74, 5S ribosomal RNA >6Y0G_L7 Chain L7, 5S rRNA >6Y2L_L7 Chain L7, 5S ribosomal RNA >6Y57_L7 Chain L7, 5S ribosomal RNA >6ZVK_d2 Chain d2, 5S RIBOSOMAL RNA >7NWG_71 Chain 71, 5S Ribosomal RNA >7NWH_7 Chain 7, 5S Ribosomal RNA >7NWI_7 Chain 7, 5S ribosomal RNA >7OBR_7 Chain 7, 5S ribosomal RNA >7MDZ_7 Chain 7, 5S rRNA >7CPU_L7 Chain L7, Mus musculus 5S ribosomal RNA >7CPV_L7 Chain L7, Mus musculus 5S ribosomal RNA >7QWR_7 Chain 7, 5S ribosomal RNA >7QWS_7 Chain 7, 5S rRNA >7QWQ_7 Chain 7, 5S ribosomal RNA >7TOQ_A5S Chain A5S, 5S rRNA >7TOR_A5S Chain A5S, 5S rRNA >7UCK_7 Chain 7, 5S rRNA >7UCJ_7 Chain 7, 5S rRNA >7OYD_K Chain K, 5S rRNA >7TM3_u Chain u, 5S ribosomal RNA >7TUT_u Chain u, 5S ribosomal RNA >8B6C_7 Chain 7, 5S rRNA >8B5L_7 Chain 7, 5S rRNA >8G5Y_L7 Chain L7, 5S rRNA >8GLP_L7 Chain L7, 5S rRNA >8BTK_B7 Chain B7, 5S rRNA >8BPO_B1 Chain B1, 5S ribosomal RNA >8P2K_B7 Chain B7, 5S rRNA >8IDT_5 Chain 5, 5S rRNA >8IDY_5 Chain 5, 5S rRNA >8IE3_5 Chain 5, 5S rRNA >8INE_5 Chain 5, 5S rRNA >8INF_5 Chain 5, 5S rRNA >8IR1_W Chain W, 5S RNA >8IR3_5 Chain 5, 5S RNA
GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTCTGATCTCGGAAGCTAAGCAGGGTCGGGCCTGGTTAGTACTT
GGATGGGAGACCGCCTGGGAATACCGGGTGCTGTAGGCTT

It seems blastdbcmd is concatenating all the headers of the entries with the same sequence when outputting each one of the redundant entries in FASTA format (default outfmt). The concatenated header is malformed as it contains the reserved character for the start of the header (>) repeated times.

When obtaining the entire BLAST nt database (Oct 13, 2023 version) in fasta format with blastdbcmd -db nt -entry all this problem arises too: the issue many be present in more than 2.6 million sequences of the approx. 100 million sequences (2.69% of sequences potentially affected). I am using blastdbcmd 2.13.0+.

One of the worst cases appears when using entry XR_008229698.1 (or any other entry with the same sequence), as it seems there are more than 2400 entries (with exactly the same sequence), all of which end up packed into the same header.

khyox avatar Dec 05 '23 01:12 khyox