fastANI is not working
Hi Dr. Olm,
I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
3355 primary clusters made
Step 3. Perform secondary clustering
Running 8999390 fastANI comparisons- should take ~ 1200.5 min
Traceback (most recent call last):
File "/install/software/anaconda3.6.b/bin/dRep", line 33, in
May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.
Thanks
Yes, you need to install it separately and need to make sure it's in your system PATH. dRep bonus test --check_dependencies will tell you which programs dRep is able to confirm are working correctly.
Hi Dr. Olm, thanks for reply. I just runt the code and see:
mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash) nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer) checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm) ANIcalculator........................... !!! ERROR !!! (location = None) prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal) centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge) nsimscan................................ !!! ERROR !!! (location = None) fastANI................................. !!! ERROR !!! (location = None)
May I ask that if I want to use fastANI for S_algorithm with dRep compare or dereplicate, do all the three "ERROR" stools need to be installed? For sure, fastANI yes. However about the other two? If yes, are the following links correct to download? For ANIcalculator, is it here: https://ani.jgi.doe.gov/html/download.php? For nsimscan, is it here: https://github.com/abadona/qsimscan ?
Many thanks for reply.
Hello,
Nope, to use fastANI you only need to correct the fastANI option. The other two are used for other S_algorithms
Hi Dr. Olm, it is great to know that. I have only fixed the fastANI option, and now looks good. mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash) nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer) checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm) ANIcalculator........................... !!! ERROR !!! (location = None) prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal) centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge) nsimscan................................ !!! ERROR !!! (location = None) fastANI................................. all good (location = /install/software/fastani/1.32/fastANI)
I have run dRep compare with fastANI, and find that the flag of fastANI '--minFraction' is 0, but the default is 0.2. May I ask that the '--minFraction' is indeed set to 0? or do I misunderstand anything?
11-09 17:51 DEBUG running cluster 1 11-09 17:51 DEBUG /install/software/fastani/1.32/fastANI --ql /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList --rl /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList -o /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/fastANI_out_tqcrxtgsrr --matrix -t 50 --minFraction 0 tqcrxtgsrr
Thanks
Hello,
Yes, the --minFraction is not used at the stage of the program being referenced in the log. It's used at a later stage when the clustering happens. The default is indeed 0.2, not 0.
Hi, thanks for reply. It is good to have the default value used. I will see it when the running is finished. Thanks again.
Hi Dr. Olm, I have run with drep compare with fastANI with 50 processors, and it shows in the log that 1200.5 min is needed as below. 11-09 17:51 INFO 3355 primary clusters made 11-09 17:51 INFO Step 3. Perform secondary clustering 11-09 17:51 INFO Running 8999390 fastANI comparisons- should take ~ 1200.5 min
However, it has been running already 45 h (2700 min), and I can see that the job is still running, however may I ask that the time indicated in the log is just estimated and the real time may be longer than that?
Thanks
Hello,
Yes it is indeed just an estimate. Depending on random factors like genome size, the specifics of the genomes being compared, and the number of cores being used, the actual time can vary a bit. The parallelization also doesn't scale perfectly; 50 cores will take more than twice as long as 25 cores, but the simple formula used to estimate how long the job will take doesn't take this into account.
Best, Matt
Hi Dr. Olm, It is good that I have finished my run with 20,000 genomes, with -pa 0.9 -sa 0.95 -nc 0.30 -cm larger.
- However there is an error for a cluster:
Running 8999390 fastANI comparisons- should take ~ 1200.5 min CRITICAL ERROR WITH SECONDARY CLUSTERING CODE hdzdcskeha; SKIPPING CRITICAL ERROR WITH PRIMARY CLUSTER 592; TRYING AGAIN CRITICAL ERROR WITH SECONDARY CLUSTERING CODE geyhqnsofz; SKIPPING DOUBLE CRITICAL ERROR AGAIN WITH PRIMARY CLUSTER 592; SKIPPING
In the Cdb.csv, there is no cluster 592 record. May I ask why?
- For dRep compare Step 4. Analyze part, there are many lines as below in my error log:
Plotting MDS plot /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : divide by zero encountered in double_scalars old_stress = stress / dis /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars if(old_stress - stress / dis) < eps: /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : invalid value encountered in double_scalars old_stress = stress / dis /install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars if(old_stress - stress / dis) < eps: ...
May I ask why? I think these issues are without any impact on the cluster, and the files in data_tables are totally fine for subsequence analysis, right?
- The value (0.30) I set with -nc is applied to --minFraction in fastANI? Because I need the alignment fraction (AF) for subsequent analysis. Where can I check the value used for --minFraction in fastANI?
Sorry about many questions and thanks for your time.
Hello,
- This is because FastANI is a somewhat finicky program, and it didn't like something about one of those genomes. Try and run fastANI on those genomes outside of dRep to see what the problem is. dRep runs FastANI like below, where genomeList is a text file listing the genome locations.
$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy
Alternatively you could run dRep again in debug mode (add -d), which will store all logs that FastANI creates to show what the error is (stored in log/cmd_lods/). I must warn you though, often fastANI crashes without an easily interpretable error, which makes it difficult to know what the problem is.
-
Yeah don't worry about those; it's just matplotlib yelling about unimportant things that might be problems but really aren't.
-
No;
-ncis applied during the clustering step. You can see the AF values for all comparisons in theNdb.csvfile in thedata_tablesfolder. During the clustering step if the AF is less than-nc, dRep will replace the actual ANI value with a0. This 0 isn't stored inNdb.csvso that you can see what the reported ANI is, when the clustering happens it will be treated as a 0.
Best, Matt
Hi Dr. Olm,
Thanks for reply. May I further ask based on your reply?
-
Where can I find the genomes that are listed in Cluster 592? I just see one genomeList lin tmp folder. Then I can run fastANI independently to see the problem.
-
That's great, thanks.
-
Based on my understanding, the AF used for secondary clustering is from -nc, right? When running fastANI, the --minFraction is set to 0, and the AF is then calculated based one the output of fastANI for each comparsion. Then dRep used -nc to do secondary clustering based on the calculated AF, right?
Thanks.
Hello,
-
It should just be the genomes that are in the datatable
Bdb.csv(which contains all filtered genomes) but are not inCdb.csv(which has the cluster assignments of all genomes). -
That is mostly correct. You are correct and AF is reported and calculated by fastANI with
--minFractionset to 0. dRep uses ANI to do the secondary clustering, however, not AF. All genomes with an AF below thenchave their ANI set to 0 by dRep, though.
-Matt
Hi Dr. Olm,
Thank you for reply.
-
Yes, but all the genomes are in the datatable Bdb.csv, right? How can I know which genomes that belong to Cluster 592? I think I only need to re-run fastANI on this cluster to check where the problem is, am I right?
-
Thanks, it makes sense to me now.
Sorry may I introduce a new question: 3.1 Is the final number of clusters from column of "secondary_cluster" in Cdb? I have checked the unique cluster number from "secondary_cluster" is 4546, and the unique cluster from "primary_cluster" is 3354. I think this is expected to have higher cluster number from "secondary_cluster", right? 3.2 I run code with -pa 0.95 -sa 0.95 -nc 0.30 -cm larger for the same genomes. I am just curious that how many clusters from Mash with 0.95, which results in 4081 clusters. I understand the problem of Mash is to underestimate the distance between the incomplete genomes and split same-species genomes into multiple clusters, which means higher number of clusters from Mash itself. However, why the number I got with -pa 0.95 from Mash (the first step of dRep) is lower than the clusters I got from dRep with -sa 0.95? Or I cannot compare them directly like this?
Thanks
Hello,
- The genomes that ARE in Bdb.csv and ARE NOT in Cdb.csv will be the ones where fastANI failed.
3.1) Yes, your understanding is correct.
3.2) I'm a little confused by the question but can explain a few things. The purpose of primary clustering (done with Mash, which is fast) is to reduce the number of comparisons that have to be done with fastANI (which is slow). The only genomes that are ever compared with FastANI are those that are in the same primary cluster by Mash; genomes in different primary clusters will never be in the same secondary clusters. Mash is not very accurate however (especially with incomplete genomes), so that's why by default dRep casts a wide net with Primary clustering, and then makes secondary clusters with fastANI.
If you'd like to compare Mash and fastANI directly, you'll need to run dRep once with the command -pa 0.95 --SkipMash and once with the command -sa 0.95 --SkipSecondary. I'll warn you though, this second command will be very slow (because it will not have the primary clustering speeding things up).
A bit more detail on this can be found here: https://drep.readthedocs.io/en/latest/choosing_parameters.html
-Matt
Hi Dr. Olm,
- sorry, I understood now, and found the cluster 592 only has one genome, and run the fastANI as below: fastANI --ql genomeList --rl genomeList -o fastANI_out_592 --matrix -t 50 --minFraction 0
It run without error:
Reference = [bin.7.fna] Query = [bin.7.fna] Kmer size = 16 Fragment length = 3000 Threads = 50 ANI output file = fastANI_out_592
INFO [thread 0], skch::main, Count of threads executing parallel_for : 50 INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24 INFO [thread 18], skch::main, ready to exit the loop INFO [thread 1], skch::main, ready to exit the loop INFO [thread 24], skch::main, ready to exit the loop INFO [thread 17], skch::main, ready to exit the loop INFO [thread 47], skch::main, ready to exit the loop INFO [thread 23], skch::main, ready to exit the loop INFO [thread 41], skch::main, ready to exit the loop INFO [thread 19], skch::main, ready to exit the loop INFO [thread 33], skch::main, ready to exit the loop INFO [thread 43], skch::main, ready to exit the loop INFO [thread 39], skch::main, ready to exit the loop INFO [thread 12], skch::main, ready to exit the loop INFO [thread 2], skch::main, ready to exit the loop INFO [thread 15], skch::main, ready to exit the loop INFO [thread 46], skch::main, ready to exit the loop INFO [thread 44], skch::main, ready to exit the loop INFO [thread 21], skch::main, ready to exit the loop INFO [thread 34], skch::main, ready to exit the loop INFO [thread 7], skch::main, ready to exit the loop INFO [thread 6], skch::main, ready to exit the loop INFO [thread 22], skch::main, ready to exit the loop INFO [thread 38], skch::main, ready to exit the loop INFO [thread 30], skch::main, ready to exit the loop INFO [thread 25], skch::main, ready to exit the loop INFO [thread 3], skch::main, ready to exit the loop INFO [thread 42], skch::main, ready to exit the loop INFO [thread 4], skch::main, ready to exit the loop INFO [thread 45], skch::main, ready to exit the loop INFO [thread 29], skch::main, ready to exit the loop INFO [thread 36], skch::main, ready to exit the loop INFO [thread 14], skch::main, ready to exit the loop INFO [thread 10], skch::main, ready to exit the loop INFO [thread 9], skch::main, ready to exit the loop INFO [thread 32], skch::main, ready to exit the loop INFO [thread 5], skch::main, ready to exit the loop INFO [thread 37], skch::main, ready to exit the loop INFO [thread 28], skch::main, ready to exit the loop INFO [thread 8], skch::main, ready to exit the loop INFO [thread 31], skch::main, ready to exit the loop INFO [thread 49], skch::main, ready to exit the loop INFO [thread 13], skch::main, ready to exit the loop INFO [thread 35], skch::main, ready to exit the loop INFO [thread 16], skch::main, ready to exit the loop INFO [thread 26], skch::main, ready to exit the loop INFO [thread 48], skch::main, ready to exit the loop INFO [thread 27], skch::main, ready to exit the loop INFO [thread 40], skch::main, ready to exit the loop INFO [thread 20], skch::main, ready to exit the loop INFO [thread 11], skch::main, ready to exit the loop INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 118228 INFO [thread 0], skch::Sketch::index, unique minimizers = 115202 INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 112667) ... (14, 1) INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup. INFO [thread 0], skch::main, Time spent sketching the reference : 0.161255 sec INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0048895 sec INFO [thread 0], skch::main, Time spent post mapping : 1.052e-06 sec INFO [thread 0], skch::main, ready to exit the loop INFO, skch::main, parallel_for execution finished
However, there is nothing in fastANI_out_592 output. I think there should be one comparison inside.
I run another cluster also with only one genome that was good with dRep, there is one comparison in fastANI_out, and the log is same as above. I am confused, is it due to the bin file itself? Could you have a look where the problem is?
Thanks
...interesting. Are you sure that cluster 592 only has a single genome in it? The size of Bdb.csv (measured with the command cat Bdb.csv | wc -l) is only 1 line longer than Cdb.csv (assessed with the command cat Cdb.csv | wc -l)? The reason I ask is that secondary comparisons are not usually run on primary clusters that only have a single genome in them, so it's a bit surprising to me.
Thank you, Matt
Hi Dr. Olm,
Yes, there is only one genome difference between Bdb.csv (200001) and Cdb.csv (20000). I run 20000 genomes together, plus the title line, that's why the Bdb.csv is with 200001 lines. Indeed, I have so many fastANI output with only one genome in /data/fastANI_files/ folder.
The top 10 lines from Cdb.csv:
genome | secondary_cluster | threshold | cluster_method | comparison_algorithm | primary_cluster
-- | -- | -- | -- | -- | --
GCF_000711905.1_ASM71190v1_genomic.fna | 1_0 | 0.05 | average | fastANI | 1
GCF_002002665.1_ASM200266v1_genomic.fna | 2_0 | 0.05 | average | fastANI | 2
GCF_902158755.1_SYNGOMJ08_genomic.fna | 3_0 | 0.05 | average | fastANI | 3
GCF_000008665.1_ASM866v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_000734035.1_ASM73403v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_001610875.1_ASM161087v1_genomic.fna | 5_0 | 0.05 | average | fastANI | 5
GCF_001654775.1_ASM165477v1_genomic.fna | 6_0 | 0.05 | average | fastANI | 6
GCF_002212725.1_ASM221272v1_genomic.fna | 7_0 | 0.05 | average | fastANI | 7
GCF_006740685.1_ASM674068v1_genomic.fna | 8_0 | 0.05 | average | fastANI | 8
GCF_014078535.1_ASM1407853v1_genomic.fna | 9_0 | 0.05 | average | fastANI | 9
GCF_013407165.1_ASM1340716v1_genomic.fna | 10_0 | 0.05 | average | fastANI | 10
The version of fastANI I used is 1.32, and the version of dRep is 2.6.2, and run dRep bonus test --check_dependencies
Loading work directory
Checking dependencies
mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. all good (location = /install/software/fastani/1.32/fastANI)
Sorry about the new problem, and thanks a million for your time.
Hello,
OK interesting- thank you for pointing this out to me. You are indeed correct and apologies for not understanding how my own program works.
Unfortunately though I don't think I'll be able to help with this problem, since it seems to be an issue with FastANI. You can try and post an issue on the FastANI page (https://github.com/ParBLiSS/FastANI/issues), but they don't seem to be answered very often.
However, in future versions of dRep I will have it assign genomes that are in a primary cluster by themselves to a cluster without needing to run fastANI.
Best, Matt
Hi Dr. Olm,
-
So do you mean that secondary comparisons are run on primary clusters that only have a single genome in them, and what I have run and the output is correct, right?
-
As for the problem I met with the cluster 592, I should ask on the FastANI page, right? I will do that. However, I think if only one genome is in this cluster, actually I do not need to run secondary comparison, and just take it as its cluster 592. Am I right?
Thanks
Hello,
Yes, you are correct on both points. FastANI failing doesn't matter in this case, because no matter what that genome would be cluster 592.
-Matt
Hi Dr. Olm,
Thank you very much for reply fast and now it does make sense to me.
Thanks again.
Hi Dr. Olm,
FYI, I just searched the answer to my fastANI problem, and I found it here https://github.com/ParBLiSS/FastANI/issues/21. I also changed the fragLen like 200, this genome worked with fastANI.
A quite small question about the command of fastANI in dRep, like below, the last 10 letters are just to distinguish the output of fastANI in fastANI_files folder, making no any effect on the process of fastANI itself, right?
$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy
Thanks
Interesting. Thank you for pointing this out to me.
Yes, the last 10 letters are randomly generated and just used to distinguish output files.
-Matt
Hi Dr. Olm,
Thank you for reply. Anyway, this is not a problem in the case of one genome in a cluster. That's great, will run all my genomes.
Thanks again.
Hi Dr. Olm,
One question from the output of run 30,000 genome together, with
dRep compare output -p 50 -pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI -g genome_path.txt
There is no error reported for Step 1. Cluster
***************************************************
..:: dRep compare Step 1. Cluster ::..
***************************************************
Loading genomes from a list
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
4362 primary clusters made
Step 3. Perform secondary clustering
Running 21641432 fastANI comparisons- should take ~ 2887.0 min
Step 4. Return output
***************************************************
..:: dRep compare Step 2. Bonus ::..
***************************************************
However, in the Cdb, one genome is missing compared to Bdb (there are 30,000 genomes). I also checked the unique primary cluster of Cdb, it is correct with 4362. I also checked the number of comparison in Mdb, it is correct with 59,999, however, there is no record in Ndb.
-
Do you know why? may be still the FastANI issue as discussed above. In addition, is it possible to check which cluster this genome is from?
-
If I leave this genome out, is the output of current clustering in Cdb still reliable?
Thanks Wang
Hi Dr. Olm,
I updated my question, can you look at it when you are available? Thanks
Hello,
This is an interesting problem and not something that I've encountered before. I don't see any way that the problem genome could impact anything else, so I believe the rest of the clustering output is still correct.
I think this is probably an issue with FastANI, though it's not clear to me what the issue is. The only way I can think of to tell which primary cluster the missing genome is in is to re-run the program with the --debug flag. This will create an additional file called CdbF.csv in the data_tables folder, which contains the clustering information for all genomes with primary clustering only. You can also run this additional step with the --SkipSecondary flag to make it run much faster if you'd like.
Best, Matt
Hi Matt,
Thanks for reply. It is good that all rest clusters are fine.
Best, Wang
Hi Matt,
When I run ~50,000 genomes, and there is an error:
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/pandas/core/reshape/reshape. py", line 124, in __init__
raise ValueError('Unstacked DataFrame is too big, '
ValueError: Unstacked DataFrame is too big, causing int32 overflow
I think this is due to the large number of genomes, right? Is there any other way to solve this problem without reducing the genome number? In addition, I can shorten my genome names with half length.
Thanks Wang
Unfortunately this error cannot be overcome with the public version of dRep. The only option is to run it in two chunks of ~25,000 genomes, and then de-replicate the results of those two runs.
I am working on the next version of dRep that will fix the problem. It's not quite finished yet, but if you'd like to give it a shot you can install it with the command pip install drep --upgrade --pre. The latest development version is v3.0.0.dev6.
When on the development version, you can add the flag --multiround_primary_clustering which will make it run mush faster and work much better overall for large genome sets like you have.