galaxy_blast
galaxy_blast copied to clipboard
makeblastdb -max_file_sz
makeblastdb has a -max_file_sz, set by default to 1GB. Can we increase this limit by default to 10GB or should we offer this as a parameter?
If you make databases that size, will it start to split them? If it so, we may need to update the BLAST datatype to look for the alternative filenames for each chunk and the alias file? (*.nal or *.pal)
I need to figure this out.
This is still failing with BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID|18349221.
One of our users is also now running into this. @bgruening do you recall if increasing -max_file_sz fixed this? I'm trying 10GB now, but this isn't exactly a quick process.
Update: -max_file_sz '10GB' causes the same error to occur later on. Oddly, none of the files are bigger than ~2GB. I'll try 100GB and see what happens, but I suspect this is just a blast bug.
I think this fixed it for me, yes.
It looks like setting -max_file_sz >2GB is either ignored or otherwise capped at 2GB. Either way, going up to 100GB still produces this error on the dataset in question here. I'm trying 2.6.0+ to see if the issue is resolved there (in that version, -max_file_sz produces an error message if you input something greater than 2GB, which is an improvement over the older behavior).
Anyway, unless this is already fixed in 2.6.0+ then I guess this is just a blast issue and not related to the wrapper.
Final update from me: this still happens in 2.6.0+. It looks like whenever the .nhr file is huge and causes multiple files to be written and the need for a .nal file that this will happen. I'll try to track down where to report blast bugs and report this there.
I am facing same issue with 9.6 GB of fasta on .. Did anyone managed to fix it?
I am using protein sequences on BLAST Galaxy Version 0.3.0
My error is "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542"
@anilthanki your error is different- the wrapper checks for duplicate sequence IDs and aborts if it finds any. BLAST+ itself copes fine but with many of the output formats including the tabular default we use duplicates become very difficult to distinguish and will most likely break your analysis.
The way the BLAST databases are defined in Galaxy as composite data types assumes a single file (no .nal or .pal alias pointing at chunks).
This indirectly limits the DB size as for large databases chunks are used.
Fixing this would be hard (and complicated to deploy now that the data types live in Galaxy itself - not sure what would happen if the tool shed defined data type was different).
Workaround is to define the DB outside Galaxy and add it to the *.lic file instead.
@peterjc I checked again and there is no duplicates. I think its something to do with the size of input..
Strange, but possible. I didn't check the wording matched my script's error message.
Can you reproduce the error calling makeblastdb at the command line outside of Galaxy?
I can not reproduce the error with command line on local machine, I used it with and without -max_file_sz parameter.
Any tip on creating database on command line without indexing so it creates only one file that i can upload to Galaxy for rest of the analysis
The BLAST databases datatype in Galaxy does not support upload into Galaxy - the expectation is you upload the FASTA file and run makeblastdb within Galaxy. Or, that the Galaxy admin adds the database to the *.loc file.
Reproducing this outside Galaxy would be really instructive - does the failing command line string Galaxy used (read this via a fail makeblastdb history entry) fail in the same way outside Galaxy?
Yes I tried running same command as Galaxy on local machine and it was failing because of "-hash_index" parameter, So I tried without indexing and it worked fine on Command-line and in Galaxy
So makeblastdb ... -hash_index was causing the "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542" error? If so, that is good to clarify, but does seem to be unrelated to the original -max_file_sz problem.
Quoting the command line help:
$ makeblastdb -help
...
-max_file_sz <String>
Maximum file size for BLAST database files
Default = `1GB'
...
Given the discussion above, it sounds like using a larger value here would be useful (since in the Galaxy context we don't currently hope with chunked databases).
It looks like the makeblastdb -max_file_sz limit was increased to 4GB in BLAST+ 2.8.0:
- The 2GB output file size limit for makeblastdb has been increased to 4 GB.
Has anyone found a solution to this? I've tried everything written here nothing seems to work. I'm getting the same error <BLAST Database creation error: Error: Duplicate seq_ids are found: DBJ|LC456629.1> I have a 16GB fasta file.. anyone.. someone..
@KinogaMIchael Does explicitly deduplicating your FASTA file first help?
It sounds like our check via https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/check_no_duplicates.py thinks the file is OK, only for BLAST itself to complain about a duplicate (the error message wording is different).
The discussion on this issue was about changing -max_file_sz which may or may not be related. The wrapper https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_makeblastdb.xml does not currently set this value. Are you able to try editing the wrapper to add this to the command line?
Or better, are you able to try running the same makeblastdb command at the terminal? And adding -max_file_sz?
Dear Peterjc,
I have the same problem, using a Fasta file of around 14GB. I did try this in the terminal, but adding -max_file_sz '20GB' will give me the following: "BLAST options error: max_file_sz must be < 4 GiB". Adding this to the xml wrapper will thus not make a difference. I'm curious for different solutions.. Thanks in advance!
Cheers, Annabel
@annabeldekker and are there duplicated identifiers in your input files?
@peterjc Hi, I checked and there aren't !
Yet you still get a message like BLAST Database creation error: Error: Duplicate seq_ids are found: ... doing this at the command line calling makeblastdb outside of Galaxy?
Exactly, when I call this outside of Galaxy I get the same Duplicate seq_ids error, despite there are no duplicates in the file. I'm using version 2.10.1 btw.
OK, good. So it isn't my fault 😉
Please email the NCBI team using blast-help (at) ncbi.nlm.nih.gov with a reproducible example (and I suggest to avoid confusing them, don't mention Galaxy). If you get a reference number it could be useful to log it here.
Thanks, @peterjc we will keep you guys updated
We circumvented the error by using the 'parse_seqids' option, which feels a bit odd. It still seems like a blast bug, but at least now it runs without issues. @KinogaMIchael maybe you could try that as well!
That does still sound like a BLAST bug, and worth reporting including the use of -parse_seqids as a possible workaround.
@peterjc deduplicating the FASTA file doesn't help..nothing helped.. @annabeldekker I think its a blast bug..i tried this in my terminal makeblastdb -in /home/Virusdb/viruses.fa -parse_seqids -blastdb_version 5 -title "virusdb" -dbtype nucl -max_file_sz 4GB and still the same error
@KinogaMIchael this does sound like a BLAST bug, please do report it to the email address requested.
Hello! I aslo meet this problem during blastmakedb database.
I have tried different versions of Blast to make the database, and meet the problem:
- BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB
- Error: mdb_env_open: Function not implemented
And finally,by blast-2.5.0+, it could runs the command makeblastdb , but the new problems comes,
Command: nohup /newlustre/home/xiongqian/software/ncbi-blast-2.5.0+/bin/makeblastdb -in nt -dbtype nucl -out nt -parse_seqids -max_file_sz 2GB & Error: file: /newlustre/home/xiongqian/database/NT/nt.44.nog file: /newlustre/home/xiongqian/database/NT/nt.45.nin file: /newlustre/home/xiongqian/database/NT/nt.45.nhr file: /newlustre/home/xiongqian/database/NT/nt.45.nsq file: /newlustre/home/xiongqian/database/NT/nt.45.nsi file: /newlustre/home/xiongqian/database/NT/nt.45.nsd file: /newlustre/home/xiongqian/database/NT/nt.45.nog file: /newlustre/home/xiongqian/database/NT/nt.nal BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|6O9K_A
Have the Duplicate seq_ids error solved?