amptk icon indicating copy to clipboard operation
amptk copied to clipboard

add SILVA/greengenes DB to 16S amptk taxonomy

Open nextgenusfs opened this issue 7 years ago • 11 comments

problem is reformatting the taxonomy information in format needed by UTAX/SINTAX. I've previously looked at SILVA taxonomy -- the taxonomy appeared to be a hot mess (I'm not a bacteriologist).... So the challenge will be convert the taxonomy strings to proper format

example taxonomy strings:

>BOLD:ACI6695;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Coleoptera,f:Elateridae,g:Nipponoelater,s:Nipponoelater babai
>S004604051;tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Hymenochaetaceae,g:Inonotus,s:Sanghuangporus zonatus
>S004127186;tax=k:Fungi,p:Ascomycota
>S004061552;tax=k:Fungi,p:Ascomycota,c:Eurotiomycetes,s:Pyrenula sanguinea

nextgenusfs avatar Oct 03 '18 17:10 nextgenusfs

update: I've tried this, but the size of the database is large and searches are incredibly slow, need to figure out why before doing this. There seems to be a lot of redundancy in the database, one solution would be to dereplicate and then LCA, but not sure that will then be any better.

nextgenusfs avatar Feb 22 '19 13:02 nextgenusfs

Failed to create database UTAX and SINTAX but usearch created. my command:

#!/bin/bash SILVA="/media/bioadmin/Think/MiseqDATA/SILVA_Database"

download the SILVA_132_SSURef_NR99 database and extract the V4 region

mkdir $SILVA && cd $SILVA RELEASE=132 URL="https://www.arb-silva.de/fileadmin/silva_databases/release_${RELEASE}/Exports" INPUT="SILVA_${RELEASE}_SSURef_Nr99_tax_silva.fasta.gz"

Download and check

wget -c ${URL}/${INPUT}{,.md5} && md5sum -c ${INPUT}.md5

Define variables and output files

Primers V3-V4 region (~465 bp) 341F & 805R

OUTPUT="${INPUT/.fasta.gz/_341F_805R.fasta}" LOG="${INPUT/.fasta.gz/_341F_805.log}" PRIMER_F="CCTACGGGNGGCWGCAG" PRIMER_R="GACTACHVGGGTATCTAATCC" ANTI_PRIMER_R="GGATTAGATACCCBDGTAGTC" MIN_LENGTH=32 MIN_F=$(( ${#PRIMER_F} * 2 / 3 )) MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))

Extract target region using forward & reverse primers

zcat "${INPUT}" | sed '/^>/ ! s/U/T/g' |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -g "${PRIMER_F}" -O "${MIN_F}" - 2> "${LOG}" |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -a "${ANTI_PRIMER_R}" -O "${MIN_R}" - 2>> "${LOG}" |
sed '/^>/ s/;/|/g ; /^>/ s/ //g ; /^>/ s// /1' > "${OUTPUT}"

Format extracted SILVA database

sed 's/://g' $SILVA/${OUTPUT} |
sed 's/-/
/g' | sed 's/(//g' | sed 's/)//g' |
sed 's/ Bacteria/;tax=d:Bacteria,/g' |
sed 's/|/p:"/1' | sed 's/|/",c:/1' | sed 's/|/,o:/1' |
sed 's/|/,f:/1' | sed 's/|/,g:/1' | sed 's/|/,s:/1' > $SILVA/AMPtk_${OUTPUT}

amptk database = This is my output:

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S --format off --create_db usearch
--skip_trimming --install --primer_required none --derep_fulllength

[12:50:36 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 [12:50:36 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0 [12:50:36 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S [12:50:36 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta [12:50:49 PM]: 594,027 records loaded [12:50:49 PM]: Using 8 cpus to process data [12:51:29 PM]: 593,752 records passed (99.95%) [12:51:29 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found [12:51:29 PM]: Now dereplicating sequences (collapsing identical sequences) [12:51:40 PM]: 359,733 records passed (60.56%) [12:51:41 PM]: Creating USEARCH Database (VSEARCH) [12:52:08 PM]: Database /usr/local/lib/python3.4/dist-packages/amptk/DB/16S.udb created successfully

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S_SINTAX --format off --create_db sintax
--skip_trimming --install --primer_required none --derep_fulllength

[12:52:46 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 [12:52:46 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0 [12:52:46 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_SINTAX [12:52:46 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta [12:52:59 PM]: 594,027 records loaded [12:52:59 PM]: Using 8 cpus to process data [12:53:36 PM]: 593,752 records passed (99.95%) [12:53:36 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found [12:53:36 PM]: Now dereplicating sequences (collapsing identical sequences) [12:53:47 PM]: 359,733 records passed (60.56%) [12:53:47 PM]: Creating SINTAX Database, this may take awhile Traceback (most recent call last): File "/usr/local/bin/amptk", line 735, in main() File "/usr/local/bin/amptk", line 726, in main mod.main(arguments) File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main makeDB(OutName, args=args) File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log) UnboundLocalError: local variable 'utax_log' referenced before assignment

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S_UTAX --format off --create_db sintax
--skip_trimming --install --primer_required none --derep_fulllength

[12:55:22 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 [12:55:22 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0 [12:55:22 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_UTAX [12:55:22 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta [12:55:35 PM]: 594,027 records loaded [12:55:35 PM]: Using 8 cpus to process data [12:56:12 PM]: 593,752 records passed (99.95%) [12:56:12 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found [12:56:12 PM]: Now dereplicating sequences (collapsing identical sequences) [12:56:22 PM]: 359,733 records passed (60.56%) [12:56:22 PM]: Creating SINTAX Database, this may take awhile Traceback (most recent call last): File "/usr/local/bin/amptk", line 735, in main() File "/usr/local/bin/amptk", line 726, in main mod.main(arguments) File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main makeDB(OutName, args=args) File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log) UnboundLocalError: local variable 'utax_log' referenced before assignment

AlexGaithuma avatar Apr 12 '19 03:04 AlexGaithuma

Likely running out of memory or the taxonomy labels not formatted correctly. Check the logfile to see what usearch died with.

nextgenusfs avatar Apr 12 '19 06:04 nextgenusfs

usearch v9.2.64_i86linux32, 4.0Gb RAM (1040Gb total), 160 cores (C) Copyright 2013-16 Robert C. Edgar, all rights reserved. http://drive5.com/usearch

00:00 37Mb 0.1% Reading 16S_SINTAX.extracted.fa 00:01 84Mb 21.5% Reading 16S_SINTAX.extracted.fa 00:02 174Mb 65.5% Reading 16S_SINTAX.extracted.fa 00:02 247Mb 100.0% Reading 16S_SINTAX.extracted.fa 00:02 214Mb 0.1% Converting to upper case
00:03 214Mb 65.2% Converting to upper case 00:03 214Mb 100.0% Converting to upper case 00:03 215Mb 0.1% Word stats
00:04 215Mb 8.9% Word stats 00:05 215Mb 22.3% Word stats 00:06 215Mb 35.9% Word stats 00:07 215Mb 47.4% Word stats 00:08 215Mb 59.2% Word stats 00:09 215Mb 70.6% Word stats 00:10 215Mb 81.7% Word stats 00:11 215Mb 93.5% Word stats 00:11 215Mb 100.0% Word stats 00:11 215Mb 100.0% Alloc rows 00:11 798Mb 0.1% Build index 00:12 798Mb 2.0% Build index 00:13 798Mb 10.2% Build index 00:14 798Mb 17.5% Build index 00:15 798Mb 24.9% Build index 00:16 798Mb 32.7% Build index 00:17 798Mb 41.0% Build index 00:18 798Mb 49.3% Build index 00:19 798Mb 57.7% Build index 00:20 798Mb 65.7% Build index 00:21 798Mb 73.7% Build index 00:22 798Mb 81.9% Build index 00:23 798Mb 88.4% Build index 00:24 798Mb 96.4% Build index 00:24 798Mb 100.0% Build index 00:24 801Mb 0.0% Initialize taxonomy data 00:25 805Mb 13.9% Initialize taxonomy data 00:26 810Mb 33.0% Initialize taxonomy data 00:27 814Mb 54.0% Initialize taxonomy data 00:28 820Mb 75.6% Initialize taxonomy data 00:29 823Mb 94.2% Initialize taxonomy data 00:29 824Mb 100.0% Initialize taxonomy data 00:29 824Mb 0.0% Building name table
00:29 826Mb 100.0% Building name table 00:29 826Mb 58553 names, tax levels min 0, avg 6.5, max 7

WARNING: 1535 taxonomy nodes have >1 parent

usearch9 -makeudb_sintax 16S_SINTAX.extracted.fa -output 16S_SINTAX.udb -notrunclabels

---Fatal error--- ../utaxdata.cpp(1176) assert failed: Size > 0 && Ids != 0

AlexGaithuma avatar Apr 12 '19 06:04 AlexGaithuma

Hopefully that error makes sense - you have sequences (at least one) that doesn’t have any taxonomy information.

nextgenusfs avatar Apr 12 '19 12:04 nextgenusfs

I am commenting here because it is related, I am trying to build a tailored V4 database out of SILVA NR 132. I guess I've managed to format the headers properly, the fasta has now around 200 Mb. However, after running:

amptk database -i '/media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132' -o V4_NR132 --format off --create_db utax --skip_trimming --install --primer_required none --derep_fulllength

I got: [06:30:23 PM]: OS: debian buster/sid, 16 cores, ~ 66 GB RAM. Python: 3.7.3 [06:30:23 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.13.6 [06:30:23 PM]: Base name set to: /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132 [06:30:23 PM]: Working on file: /media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132 [06:30:33 PM]: 556,586 records loaded [06:30:33 PM]: Using 16 cpus to process data [06:30:46 PM]: 556,389 records passed (99.96%) [06:30:46 PM]: Errors: 0 no taxonomy info, 0 no ID, 1 length out of range, 196 too many ambiguous bases, 0 no primers found [06:30:46 PM]: Now dereplicating sequences (collapsing identical sequences) [06:30:50 PM]: 255,268 records passed (45.86%) [06:30:50 PM]: Creating UTAX Database, this may take awhile [07:03:52 PM]: There was a problem creating the DB, check the UTAX log file /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.utax.log

The utax logfile says:

00:00 37Mb 0.1% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 112Mb 7 0.6% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 142Mb 100.0% Reading /ho me/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa 00:01 108Mb 0.1% Converting to upper case ^M00:01 108Mb 10 0.0% Converting to upper case 00:01 109Mb 0.1% Word stats ^M00:02 109Mb 55.9% Word stats^M00:02 109Mb 100.0% Word stats 00:02 109Mb 100.0% Alloc rows 00:02 358Mb 0.1% Build index^M00:03 358Mb 19.3% Build index^M00:04 358Mb 66.9% Build index^M00:04 358Mb 100.0% Build index 00:04 360Mb 0.0% Initialize taxonomy data^M00:05 364Mb 17.6% Initialize taxonomy data^M00:06 373Mb 75.9% Initialize taxonomy data ^M00:06 377Mb 100.0% Initialize taxonomy data 00:06 377Mb 0.0% Building name table ^M00:06 378Mb 100.0% Building name table 00:06 378Mb 41947 names, tax levels min 2, avg 6.6, max 7

WARNING: 1247 taxonomy nodes have >1 parent 00:06 491Mb 0.1% Word stats^M00:07 491Mb 33.8% Word stats^M00:07 491Mb 100.0% Word stats 00:07 491Mb 100.0% Alloc rows 00:07 740Mb 0.1% Build index^M00:08 740Mb 7.0% Build index^M00:09 740Mb ...... 27.6% Distance matrix/usort^M33:02 4.3Gb 27.6% Distance matrix/usort ---Fatal error--- Memory limit of 32-bit process exceeded, 64-bit build required

I guess this is related to the 32bit version of usearch being limited to 4 Gb, unfortunately, 64bit version is expensive. My system has 64 Gb.

Is there any way to create this database using vsearch? Is it possible to split the fasta, loop in the database creation then join files at the end? Any suggestions would be good.

Best,

FilipeMatteoli avatar Nov 01 '19 13:11 FilipeMatteoli

UTAX is specific to usearch. The memory limit doesn’t seem to always be consistent, technically it should be 4 GB but that doesn’t mean it can work on files equal to 4 GB. Anyway the current hybrid method will use the entire database with vsearch for global alignment. So you can randomly sub sample the input data to generate the UTAX and sintax databases whole keeping the whole thing for the usearch database.

nextgenusfs avatar Nov 01 '19 13:11 nextgenusfs

Not sure if this will help now but just in case anyone else wants to convert silva to sintax fasta format, I wrote this.. it won't give the proper taxonomic levels to eukaryotes in silva132 but it will work with -makeudb_sintax. Theo

#!/usr/bin/env python

#reformats SILVA formatted taxonomy in the header of fasta files to be compatible with usearch and sintax dbs.
#useage:
#reformat_silva.py infile.fasta outfile.fasta

import sys

f= open(sys.argv[1],'r')
h= open(sys.argv[2],'w')
tax=['d','p','c','o','f','g','s']


for i in f:
	out=""
	if i[0]==">":
		k=i.split(";")
		if len(k)>7:
			k=k[-7:]
		acc=k[0].split(" ")[0]
		k[-1]=k[-1].rstrip("\n")
		if " " in k[0]:
			k[0]=k[0].split(" ")[1]
		c=-1
		for x in k:
			c=c+1
			out=out+tax[c]+":"+x+","
		
		out=out.rstrip(",")
		h.write(acc+";tax="+out+"\n")
			
		
	else:
		i=i.replace('U','T')
		h.write(i)

I am trying to make the SILVA LSU fasta file into a new LSU database in amptk. I think I have the taxonomy fixed (partially using the script from theo-allnutt-bioinformatics above), but now when I try to create the database, amptk gives me errors. In the command line it says

[Jan 31 03:57 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
  File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 736, in <module>
    main()
  File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 727, in main
    mod.main(arguments)
  File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 623, in main
    makeDB(OutName, args=args)
  File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 418, in makeDB
    amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment

in the log file it says

usearch9(16829,0xa98521c0) malloc: *** mach_vm_map(size=2963820544) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug


usearch9 -makeudb_sintax /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.extracted.fa -output /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.udb -notrunclabels

---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required

I tried Googling malloc_error_break and the results I found suggested that the script needs to be debugged to see where the error is occurring and then insert the breakpoint. I do not know how to do this, so do you have any suggestions?

nicolereynolds1 avatar Jan 31 '20 21:01 nicolereynolds1

Malloc errors are memory related, so the free version of usearch is 32 but and thus the file you are trying to make a database out of is too large and it errors. So need to reduce the size to build the sintax and utax databases. The global alignment or usearch uses vsearch which doesn’t have the memory limit. Look at the readthedocs manual on how I subsampled the data for other databases to build sintax and utax.

nextgenusfs avatar Jan 31 '20 22:01 nextgenusfs

Going back to the Green Genes question, I have made an AMPtk formatted Green Genes database (too big to upload here, so I've put it here (gg_13_5_replace), which is also where the 28S database I made (RDP_SILVA_LSU_update) is stored). I added a few endosymbiont sequences (for the specific project I'm working on) and some fungal 18S sequences because the primers we've been using sometimes amplify fungal sequences.

I was able to successfully install it into AMPtk.

nicolereynolds2 avatar Jul 18 '23 15:07 nicolereynolds2