galaxy_blast icon indicating copy to clipboard operation
galaxy_blast copied to clipboard

makeblastdb -max_file_sz

Open bgruening opened this issue 10 years ago • 30 comments

makeblastdb has a -max_file_sz, set by default to 1GB. Can we increase this limit by default to 10GB or should we offer this as a parameter?

bgruening avatar Jun 09 '15 21:06 bgruening

If you make databases that size, will it start to split them? If it so, we may need to update the BLAST datatype to look for the alternative filenames for each chunk and the alias file? (*.nal or *.pal)

peterjc avatar Jun 10 '15 07:06 peterjc

I need to figure this out. This is still failing with BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID|18349221.

bgruening avatar Jun 10 '15 11:06 bgruening

One of our users is also now running into this. @bgruening do you recall if increasing -max_file_sz fixed this? I'm trying 10GB now, but this isn't exactly a quick process.

Update: -max_file_sz '10GB' causes the same error to occur later on. Oddly, none of the files are bigger than ~2GB. I'll try 100GB and see what happens, but I suspect this is just a blast bug.

dpryan79 avatar Jan 10 '17 10:01 dpryan79

I think this fixed it for me, yes.

bgruening avatar Jan 10 '17 12:01 bgruening

It looks like setting -max_file_sz >2GB is either ignored or otherwise capped at 2GB. Either way, going up to 100GB still produces this error on the dataset in question here. I'm trying 2.6.0+ to see if the issue is resolved there (in that version, -max_file_sz produces an error message if you input something greater than 2GB, which is an improvement over the older behavior).

Anyway, unless this is already fixed in 2.6.0+ then I guess this is just a blast issue and not related to the wrapper.

dpryan79 avatar Jan 10 '17 12:01 dpryan79

Final update from me: this still happens in 2.6.0+. It looks like whenever the .nhr file is huge and causes multiple files to be written and the need for a .nal file that this will happen. I'll try to track down where to report blast bugs and report this there.

dpryan79 avatar Jan 10 '17 14:01 dpryan79

I am facing same issue with 9.6 GB of fasta on .. Did anyone managed to fix it?

I am using protein sequences on BLAST Galaxy Version 0.3.0

My error is "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542"

anilthanki avatar Nov 28 '18 15:11 anilthanki

@anilthanki your error is different- the wrapper checks for duplicate sequence IDs and aborts if it finds any. BLAST+ itself copes fine but with many of the output formats including the tabular default we use duplicates become very difficult to distinguish and will most likely break your analysis.

peterjc avatar Nov 29 '18 08:11 peterjc

The way the BLAST databases are defined in Galaxy as composite data types assumes a single file (no .nal or .pal alias pointing at chunks).

This indirectly limits the DB size as for large databases chunks are used.

Fixing this would be hard (and complicated to deploy now that the data types live in Galaxy itself - not sure what would happen if the tool shed defined data type was different).

Workaround is to define the DB outside Galaxy and add it to the *.lic file instead.

peterjc avatar Nov 29 '18 08:11 peterjc

@peterjc I checked again and there is no duplicates. I think its something to do with the size of input..

anilthanki avatar Nov 29 '18 12:11 anilthanki

Strange, but possible. I didn't check the wording matched my script's error message.

Can you reproduce the error calling makeblastdb at the command line outside of Galaxy?

peterjc avatar Nov 29 '18 13:11 peterjc

I can not reproduce the error with command line on local machine, I used it with and without -max_file_sz parameter.

Any tip on creating database on command line without indexing so it creates only one file that i can upload to Galaxy for rest of the analysis

anilthanki avatar Nov 29 '18 14:11 anilthanki

The BLAST databases datatype in Galaxy does not support upload into Galaxy - the expectation is you upload the FASTA file and run makeblastdb within Galaxy. Or, that the Galaxy admin adds the database to the *.loc file.

Reproducing this outside Galaxy would be really instructive - does the failing command line string Galaxy used (read this via a fail makeblastdb history entry) fail in the same way outside Galaxy?

peterjc avatar Nov 29 '18 20:11 peterjc

Yes I tried running same command as Galaxy on local machine and it was failing because of "-hash_index" parameter, So I tried without indexing and it worked fine on Command-line and in Galaxy

anilthanki avatar Nov 30 '18 10:11 anilthanki

So makeblastdb ... -hash_index was causing the "BLAST Database creation error: Error: Duplicate seq_ids are found: GNL|BL_ORD_ID:3299542" error? If so, that is good to clarify, but does seem to be unrelated to the original -max_file_sz problem.

Quoting the command line help:

$ makeblastdb -help
...
-max_file_sz <String>
  Maximum file size for BLAST database files
  Default = `1GB'
...

Given the discussion above, it sounds like using a larger value here would be useful (since in the Galaxy context we don't currently hope with chunked databases).

peterjc avatar Nov 30 '18 12:11 peterjc

It looks like the makeblastdb -max_file_sz limit was increased to 4GB in BLAST+ 2.8.0:

  • The 2GB output file size limit for makeblastdb has been increased to 4 GB.

nathanweeks avatar Apr 15 '19 17:04 nathanweeks

Has anyone found a solution to this? I've tried everything written here nothing seems to work. I'm getting the same error <BLAST Database creation error: Error: Duplicate seq_ids are found: DBJ|LC456629.1> I have a 16GB fasta file.. anyone.. someone..

KinogaMIchael avatar May 07 '21 10:05 KinogaMIchael

@KinogaMIchael Does explicitly deduplicating your FASTA file first help?

It sounds like our check via https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/check_no_duplicates.py thinks the file is OK, only for BLAST itself to complain about a duplicate (the error message wording is different).

The discussion on this issue was about changing -max_file_sz which may or may not be related. The wrapper https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_makeblastdb.xml does not currently set this value. Are you able to try editing the wrapper to add this to the command line?

Or better, are you able to try running the same makeblastdb command at the terminal? And adding -max_file_sz?

peterjc avatar May 07 '21 10:05 peterjc

Dear Peterjc,

I have the same problem, using a Fasta file of around 14GB. I did try this in the terminal, but adding -max_file_sz '20GB' will give me the following: "BLAST options error: max_file_sz must be < 4 GiB". Adding this to the xml wrapper will thus not make a difference. I'm curious for different solutions.. Thanks in advance!

Cheers, Annabel

annabeldekker avatar May 11 '21 08:05 annabeldekker

@annabeldekker and are there duplicated identifiers in your input files?

peterjc avatar May 11 '21 09:05 peterjc

@peterjc Hi, I checked and there aren't !

annabeldekker avatar May 11 '21 09:05 annabeldekker

Yet you still get a message like BLAST Database creation error: Error: Duplicate seq_ids are found: ... doing this at the command line calling makeblastdb outside of Galaxy?

peterjc avatar May 11 '21 10:05 peterjc

Exactly, when I call this outside of Galaxy I get the same Duplicate seq_ids error, despite there are no duplicates in the file. I'm using version 2.10.1 btw.

annabeldekker avatar May 11 '21 11:05 annabeldekker

OK, good. So it isn't my fault 😉

Please email the NCBI team using blast-help (at) ncbi.nlm.nih.gov with a reproducible example (and I suggest to avoid confusing them, don't mention Galaxy). If you get a reference number it could be useful to log it here.

peterjc avatar May 11 '21 11:05 peterjc

Thanks, @peterjc we will keep you guys updated

annabeldekker avatar May 11 '21 11:05 annabeldekker

We circumvented the error by using the 'parse_seqids' option, which feels a bit odd. It still seems like a blast bug, but at least now it runs without issues. @KinogaMIchael maybe you could try that as well!

annabeldekker avatar May 11 '21 14:05 annabeldekker

That does still sound like a BLAST bug, and worth reporting including the use of -parse_seqids as a possible workaround.

peterjc avatar May 11 '21 14:05 peterjc

@peterjc deduplicating the FASTA file doesn't help..nothing helped.. @annabeldekker I think its a blast bug..i tried this in my terminal makeblastdb -in /home/Virusdb/viruses.fa -parse_seqids -blastdb_version 5 -title "virusdb" -dbtype nucl -max_file_sz 4GB and still the same error

KinogaMIchael avatar May 18 '21 14:05 KinogaMIchael

@KinogaMIchael this does sound like a BLAST bug, please do report it to the email address requested.

peterjc avatar May 18 '21 15:05 peterjc

Hello! I aslo meet this problem during blastmakedb database.
I have tried different versions of Blast to make the database, and meet the problem:

  1. BLAST Database creation error: Multi-letters chain PDB id is not supported in v4 BLAST DB
  2. Error: mdb_env_open: Function not implemented

And finally,by blast-2.5.0+, it could runs the command makeblastdb , but the new problems comes,

Command: nohup /newlustre/home/xiongqian/software/ncbi-blast-2.5.0+/bin/makeblastdb -in nt -dbtype nucl -out nt -parse_seqids -max_file_sz 2GB & Error: file: /newlustre/home/xiongqian/database/NT/nt.44.nog file: /newlustre/home/xiongqian/database/NT/nt.45.nin file: /newlustre/home/xiongqian/database/NT/nt.45.nhr file: /newlustre/home/xiongqian/database/NT/nt.45.nsq file: /newlustre/home/xiongqian/database/NT/nt.45.nsi file: /newlustre/home/xiongqian/database/NT/nt.45.nsd file: /newlustre/home/xiongqian/database/NT/nt.45.nog file: /newlustre/home/xiongqian/database/NT/nt.nal BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|6O9K_A

Have the Duplicate seq_ids error solved?

xiongqian123456789 avatar Nov 04 '22 07:11 xiongqian123456789