Different output formats report different numbers of HSPs
This is probably a dumb question, but I don't understand why when I switch output format from 6 (blast-style) to 102 (taxonomy), the number of pairwise alignments changes. I made a small sample FASTA and queried it against uniref100, and the number of alignments drops from 1525 to 52. Here are the two command lines:
diamond blastp --db ../2022-07-26metaproteomics/uniref100.db.dmnd --query sample.fasta --out outfmt6.diamond.txt --outfmt 6 qseqid stitle pident evalue mismatch --min-orf 1 --log
diamond blastp --db ../2022-07-26metaproteomics/uniref100.db.dmnd --query sample.fasta --out outfmt102.diamond.txt --outfmt 102 --min-orf 1 --log
Log files and sample fasta are attached. outfmt6.diamond.log.txt outfmt102.diamond.log.txt sample.fasta.txt
The taxonomic format does not output alignments. It merges all alignments for one query into one classification using the LCA algorithm.
OK, that makes sense. So in my example, there are four sequences that have at least one associated alignment:
% awk '{print $1}' outfmt6.diamond.txt | sort -u
scan111
scan139
scan178
scan205
But there are no taxonomic assignments at all. Why is that? I would have expected to see non-zero entries for those four sequences.
It's possible that the targets they align against have no taxonomy assignment, you can check for that.
As far as I can tell from the annotations in uniref100, they do have taxonomy assignments. For example, here are the entries for the first sequence:
scan111 UniRef100_A0A2M7VQ01 RelA/SpoT family protein n=2 Tax=Bacteria TaxID=2 RepID=A0A2M7VQ01_9FLAO 92.6 2.78e-05 2
scan111 UniRef100_A0A1J4TFK1 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium CG1_02_35_72 TaxID=1805158 RepID=A0A1J4TFK1_9FLAO 92.6 2.78e-05 2
scan111 UniRef100_A0A3R7T962 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Flavobacteriaceae bacterium TMED81 TaxID=1986719 RepID=A0A3R7T962_9FLAO 88.9 5.20e-05 3
scan111 UniRef100_A0A2G6F384 RelA/SpoT family protein n=1 Tax=Flavobacteriia bacterium TaxID=2044941 RepID=A0A2G6F384_9BACT 88.9 7.12e-05 3
scan111 UniRef100_UPI001CD3543A RelA/SpoT family protein n=1 Tax=Lutimonas saemankumensis TaxID=483016 RepID=UPI001CD3543A 85.2 9.74e-05 4
scan111 UniRef100_A0A2E8SD77 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium TaxID=1871037 RepID=A0A2E8SD77_9FLAO 85.2 9.74e-05 4
scan111 UniRef100_A0A2D6Y7W9 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium TaxID=1871037 RepID=A0A2D6Y7W9_9FLAO 85.2 9.74e-05 4
scan111 UniRef100_A0A7R8X324 Hypothetical protein (Fragment) n=1 Tax=Cyprideis torosa TaxID=163714 RepID=A0A7R8X324_9CRUS 85.2 1.12e-04 4
scan111 UniRef100_A0A432IBK3 RelA/SpoT family protein n=1 Tax=Flavobacteriia bacterium TaxID=2044941 RepID=A0A432IBK3_9BACT 85.2 1.33e-04 4
scan111 UniRef100_A0A7C6IMR6 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Bacteroidales bacterium TaxID=2030927 RepID=A0A7C6IMR6_9BACT 85.2 1.82e-04 4
scan111 UniRef100_A0A838ZTQ7 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Moheibacter lacus TaxID=2745851 RepID=A0A838ZTQ7_9FLAO 85.2 1.82e-04 4
scan111 UniRef100_A0A3D6CIB6 RelA/SpoT family protein n=1 Tax=Polaribacter sp. TaxID=1920175 RepID=A0A3D6CIB6_9FLAO 88.9 1.82e-04 3
scan111 UniRef100_A0A2I2MBN6 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2I2MBN6_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A2H1XY37 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2H1XY37_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A2H1YCR3 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2H1YCR3_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A2H1YH65 MFS transporter n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=A0A2H1YH65_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_UPI001F227220 RelA/SpoT family protein n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=UPI001F227220 85.2 2.49e-04 4
scan111 UniRef100_UPI001EFAD267 RelA/SpoT family protein n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=UPI001EFAD267 85.2 2.49e-04 4
scan111 UniRef100_A0A2H1XT05 MFS transporter n=1 Tax=Tenacibaculum dicentrarchi TaxID=669041 RepID=A0A2H1XT05_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A839AMJ4 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Tenacibaculum pelagium TaxID=2759527 RepID=A0A839AMJ4_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A0U3K3R5 MFS transporter n=1 Tax=Tenacibaculum dicentrarchi TaxID=669041 RepID=A0A0U3K3R5_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A3Q8RRV4 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=2 Tax=Tenacibaculum TaxID=104267 RepID=A0A3Q8RRV4_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A2G1BXG4 RelA/SpoT family protein n=2 Tax=Tenacibaculum TaxID=104267 RepID=A0A2G1BXG4_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A1M5GLM6 GTP pyrophosphokinase n=1 Tax=Tenacibaculum mesophilum TaxID=104268 RepID=A0A1M5GLM6_9FLAO 85.2 2.49e-04 4
scan111 UniRef100_A0A4R6THM8 GTP pyrophosphokinase n=1 Tax=Tenacibaculum caenipelagi TaxID=1325435 RepID=A0A4R6THM8_9FLAO 85.2 2.49e-04 4
Shouldn't these be picked up by the NCBI taxonomy files?
No, the NCBI files do not map Uniprot accessions. I'll have to look into providing taxonomy mapping for uniprot.
Is there a different database you recommend? I'd like to get something non-redundant that is as extensive as possible but only with sequences that have taxonomic assignments.
The nr database would be the most obvious choice for an extensive database, and I think most sequences should have a taxonomic assignment. AnnoTree could also be an option, see here: https://journals.asm.org/doi/full/10.1128/msystems.01408-21
Thanks, I will try those. I agree with you that adding support for uniprot accessions would probably be good, but you don't need to prioritize it on my account.
OK, I downloaded the NR database from NCBI, but when I map a bunch of human peptides I still get only a tiny fraction that are assigned a taxonomic unit. Here is the command line for building the DB:
diamond makedb --in /net/noble/vol1/data/ncbi/pub/blast/db/FASTA/nr.gz --db nr.db --taxonmap /net/noble/vol1/data/ncbi/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz --taxonnodes /net/noble/vol1/data/ncbi/pub/taxonomy/nodes.dmp --taxonnames /net/noble/vol1/data/ncbi/pub/taxonomy/names.dmp --log
And here is the command line for the query:
diamond blastp --db nr.db --query human.fasta --out human.diamond.txt --outfmt 102 --log
The fasta file contains human peptide sequences, but only 8 out of 11,688 get matched. The input file, log files, and output file are attached. I must be doing something wrong, but I'm not sure what it is.
human.diamond.log.txt human.diamond.txt human.fasta.txt nr.db.log.txt
Diamond was not designed for such short sequences. With some tuning of the parameters that may work but also be pretty slow if you use such a big database as the NR. Is BLAST too slow for you or what's your reason for using Diamond?
I was following the lead of this article, which used DIAMOND for taxonomic assignments for metaproteomics data: https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00334
But yes, I suppose I could just use BLAST.
It occurs to me that maybe the problem is that I didn't unzip the nr.gz file before doing the makedb command.
I was following the lead of this article, which used DIAMOND for taxonomic assignments for metaproteomics data: https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00334
They used Diamond for this sort of data, but surely must have modified the parameters. Let me know in case it doesn't work with BLAST, and I'll take a look.
It occurs to me that maybe the problem is that I didn't unzip the nr.gz file before doing the makedb command.
This is no problem, Diamond can read gzipped files.
I had already contacted the authors of that JPR paper, and here is the command line they used:
diamond blastp -d uniref100 --min-score 1 -q kaiko.fasta -o kaiko.dmd -f 6 qseqid stitle pident evalue mismatch
They then post-processed the results to consider only exact matches. They told me that they fail to map many of the sequences at all. If you have any suggestions for how to nudge DIAMOND toward handling short sequences, I'd love to hear them.
FWIW, I tried this exact mapping tool, but to index NR (which is 254 G), you need a machine with >254 G of RAM.
Just circling back on this. Do you have any suggestions for parameter settings that would improve DIAMOND's handling of short sequences?
These parameters seem to work ok for me for your data:
-c1 --ultra-sensitive -s1 --id2 1 --short-query-ungapped-bitscore 1 -e10000 --algo 0 --masking 0 --gapped-filter-evalue 0
To use all of them, you need to compile from source using the cmake option -DEXTRA=ON. The number of shapes -s and the evalue could also be increased for higher sensitivity.
Thanks! This is super helpful. Now it succeeds in matching 29% of the peptides.