SqueezeMeta icon indicating copy to clipboard operation
SqueezeMeta copied to clipboard

estimating gene abundances from each metagenome

Open mars188 opened this issue 2 years ago • 18 comments

Hello,

I already have several bins/metagenomes that I obtained from other software. I only want to estimate gene abundances from these bins. I know squeezemeta can estimate gene abundance from individual sample but not sure how can I do this with genomes.

Can you please direct me to the script that shows how to estimate genes abundance from each genome/bin (not sample)? Second part of the question is can I use genomes obtained from other software or I need to use squeezemeta from A-Z for this purpose?

Many thanks in advance!

mars188 avatar Jun 14 '22 05:06 mars188

Hello! SqueezeMeta calculates abundance measures for bins by default. You can do what you ask for, but it requieres running several parts of the pipeline. I will assume that you are using version 1.5 of SqueezeMeta. Otherwise I encourage you to update it. Supposing you already have an assembly with contigs, as it seems, run SqueezeMeta using the -extassembly option and specifying also -test 3 to stop after step 3 (gene prediction). You need the abundances of the contigs, so you must run step 10: perl 10.mapsamples.pl <project> Then, in the results directory, create a directory named "bins". Then, copy your bins, in fasta format, to that directory And then, run step 18 of the pipeline: perl 18.getbins.pl <project>

I am not sure if this will work fine. For sure you will be missing some information, for instance the taxonomy of the bins and especially the values for completeness and contamination.

A safer approach would be to run the full SqueezeMeta pipeline until the end (again using the -extassembly option to use your contigs), then substitute the bins in rhe results/bins directory for your own, and restart the pipeline from step 16: perl restart.pl <project> -step 16. That will work for sure. Best, Javier

jtamames avatar Jun 14 '22 09:06 jtamames

Thank you so much for your quick response. What you said totally makes sense so I would go with the second option that you mentioned. I am going to run the pipeline and will get back to you if I need further help.

Many thanks again for your time and support!

mars188 avatar Jun 14 '22 17:06 mars188

Hello jtamames,

As you suggested, I completed the full squeezemeta pipeline until the end using -extassembly. Here is the command: SqueezeMeta.pl -m coassembly -p mangrove -s envi.samples -f clean_reads -extassembly megahit/final_assembly.fasta -t 124

After that, I replaced the results/bins with my own 17 bins. And restarted the analysis from step 16 as below: restart.pl -step 16 mangrove

However, it produced the following error shortly after the run started.

[0 seconds]: STEP16 -> BIN TAX ASSIGNMENT: 16.addtax2.pl [19 seconds]: STEP17 -> CHECKING BINS: 17.checkM_batch.pl Evaluating bins with CheckM (Parks et al 2015, Genome Res 25, 1043-55)

Reading /scratch/gencore/conda3/envs/SqueezeMeta/SqueezeMeta/data/alltaxlist.txt Looking for DAS bins in /scratch/gencore/ma5877/metaphlan/environ_analysis/original/metawrap/squeezemeta/mangrove/results/bins 17 bins found

Bin 1/17: bin10.fa.tax Bin 2/17: bin54.fa.tax Bin 3/17: bin41.fa.tax Bin 4/17: bin72.fa.tax Bin 5/17: bin48.fa.tax Bin 6/17: bin40.fa.tax Bin 7/17: bin39.fa.tax Bin 8/17: bin7.fa.tax Bin 9/17: bin17.fa.tax Bin 10/17: bin76.fa.tax Bin 11/17: bin49.fa.tax Bin 12/17: bin26.fa.tax Bin 13/17: bin24.fa.tax Bin 14/17: bin79.fa.tax Bin 15/17: bin44.fa.tax Bin 16/17: bin73.fa.tax Bin 17/17: bin55.fa.tax

Storing results for DAS in /scratch/gencore/ma5877/metaphlan/environ_analysis/original/metawrap/squeezemeta/mangrove/intermediate/17.mangrove.checkM wc: /scratch/gencore/ma5877/metaphlan/environ_analysis/original/metawrap/squeezemeta/mangrove/intermediate/17.mangrove.checkM: No such file or directory Can't find /scratch/gencore/ma5877/metaphlan/environ_analysis/original/metawrap/squeezemeta/mangrove/intermediate/17.mangrove.checkM Stopping in STEP18 -> 17.checkM_batch.pl Died at /scratch/gencore/conda3/envs/SqueezeMeta/bin/restart.pl line 444.

It probably can not find 17.mangrove.checkM that should have been generated in the previous step. Although it did not encounter this error in the first full run. Only happened when I replaced my bins and restarted from step 16.

Could you please take a look at it and what could be causing this error?

Many thanks,

mars188 avatar Jun 17 '22 03:06 mars188

Hello Could you show me the last lines of any of the .tax files, for instance bin10.fa.tax? Also, can you send me the syslog file? Best, J

jtamames avatar Jun 17 '22 05:06 jtamames

Last two lines look like this: Consensus: Total size: 692458 Disparity: 0.000 List of taxa (abundance >= 1%): k:;p:;c:;o:;f:;g:;s:;

Please see attached the syslog file. Seems like this is from the previous (full) run. For the second run (restarting from step 16), this file was not generated as I can't find it.

syslog.txt

Many thanks for your time and support!

mars188 avatar Jun 17 '22 06:06 mars188

Ok, the problem is that checkM did not run because no taxonomic annotation was produced in step 16. This is probably due to different contigs names between the assemby (megahit/final_assembly.fasta, I guess this is coming from a previous SqueezeMeta run) and the bins. Did you use this very same assembly for doing the external binning?

jtamames avatar Jun 17 '22 07:06 jtamames

Yes, I performed the external binning using this same megahit/final_assembly.fasta assembly. Only difference is that I obtained about 70 bins in total with external binning approach and provided only 17 selected bins to SqeezeMeta pipeline with -extassembly megahit/final_assembly.fasta option.

Should I provide all the 70 externally produced bins to SqueezeMeta? or what could be other options for me to move on with this?

Many thanks again for your help!

mars188 avatar Jun 19 '22 07:06 mars188

Hello No, it does not matter that you selected just a few bins. The problem likely lies in an inconsistency between names of contigs in the assembly and contigs in bins. Could you please do the following:

grep ">" 01.mangrove.fasta | head (in the results directory) head bin10.fa.tax (in the results/bin directory) head 09.mangrove.contiglog (in the intermediate directory)

And paste the results here?

jtamames avatar Jun 20 '22 08:06 jtamames

Here it is: Command: grep ">" 01.mangrove.fasta | head (in the results directory) Output: `>megahit_1

megahit_2 megahit_3 megahit_4 megahit_5 megahit_6 megahit_7 megahit_8 megahit_9 megahit_10`

Command: head bin10.fa.tax (in the results/bin directory) Output: `>NODE_1_length_57210_cov_3.299622

NODE_2_length_24250_cov_3.854681 NODE_3_length_23963_cov_4.522921 NODE_4_length_22853_cov_4.036144 NODE_5_length_17694_cov_3.412212 NODE_6_length_17196_cov_3.974037 NODE_7_length_17079_cov_4.588170 NODE_8_length_12721_cov_2.552582 NODE_9_length_11537_cov_3.686988 NODE_10_length_11492_cov_3.274786`

Command: head 09.mangrove.contiglog (in the intermediate directory) Output: #- Created by /scratch/gencore/conda3/envs/SqueezeMeta/SqueezeMeta/scripts/09.summarycontigs3.pl with data from /scratch/gencore/ma5877/metaphlan/environ_analysis/original/metawrap/squeezemeta/mangrove/results/06.mangrove.fun3.tax.wranks, mingenes=1, minconsperc_asig=0.7, minconsperc_total=0.5, euknofilter=0,Wed Jun 15 03:34:40 2022 megahit_1 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Photobacterium g Disparity: 0.335 Genes: 213 megahit_2 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae f Disparity: 0.018 Genes: 175 megahit_3 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Photobacterium g Disparity: 0.298 Genes: 159 megahit_4 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Photobacterium g Disparity: 0.044 Genes: 153 megahit_5 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Vibrio g Disparity: 0.000 Genes: 129 megahit_6 k_Eukaryota;n_Viridiplantae;p_Streptophyta p Disparity: 0.042 Genes: 94 megahit_7 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Photobacterium g Disparity: 0.019 Genes: 132 megahit_8 k_Archaea;p_Euryarchaeota;c_Theionarchaea c Disparity: 0.021 Genes: 127 megahit_9 k_Bacteria;p_Proteobacteria;c_Gammaproteobacteria;o_Vibrionales;f_Vibrionaceae;g_Photobacterium g Disparity: 0.265 Genes: 121

mars188 avatar Jun 20 '22 12:06 mars188

Names of contigs in the assembly and contigs in bins do look different, as you predicted. How can we fix this please?

mars188 avatar Jun 20 '22 12:06 mars188

Yep. Do you notice that names of the contigs are different in the file 01.mangrove.fasta and in bin10.fa.tax? That is the problem. What you need to do is to do your binning using the 01.mangrove.fasta file, so that the contigs have the same name. Or, alternatively, run SqueezeMeta with --extassembly and --norename to preserve the original contig names. Best, J

jtamames avatar Jun 20 '22 12:06 jtamames

Ok, thank you sooo much. Now running it again and added --norename to the existing script. Will update you on this after run finishes.

Many thanks again!

mars188 avatar Jun 20 '22 12:06 mars188

Ok. You can check right away the 01.mangrove.fasta file to be sure the names of the contigs are correct. Best, J

jtamames avatar Jun 20 '22 12:06 jtamames

Yes, the above command (grep ">" results/01.mangrove.fasta | head) produces these headers:

k119_1895242_length_219358_cov_33.0684 k119_284560_length_174923_cov_34.3021 k119_264955_length_165023_cov_36.8075 k119_1124181_length_158402_cov_32.8365 k119_93162_length_151822_cov_33.5871 k119_74209_length_150713_cov_22.9741 k119_704680_length_148697_cov_241.8619 k119_1806072_length_141893_cov_26.9927 k119_334843_length_138599_cov_32.9593 k119_877772_length_137089_cov_18.0030

That means it's not changing names so it should work this time. I will to wait until the run completes and then I will run restart.pl starting from step 16. Let's see how it goes.

Thank youuuuu so much!

mars188 avatar Jun 20 '22 14:06 mars188

Ok, the run has completed and then I submitted the restart.pl -step 16 mangrove and that has also completed successfully. Now I have all the required tables, I believe.

Next, I want visualize the the following:

  • Estimate the gene abundances in each genome (not samples)
  • Estimate the KEGG pathways abundance in each genome (not samples)

So, if I run the output in R with SQM, I should be able to visualize it or I need to perform any extra analysis? I saw in your "Using R to analyze your SQM results" tutorial where results have been plotted against samples (not genomes). However, I want to plot gene and KEGG pathway abundances across my 17 genomes (I don't care about samples).

Please let me know. You have been incredibly helpful for me!

Many many thanks again.

mars188 avatar Jun 22 '22 06:06 mars188

Dear @jtamames

Hope you're doing well. Could you please let me know how can I calculate the gene abundance in each genome?

Next, what I did is that I ran this command: SqueezeMeta.pl -m coassembly -p mangrove -s envi.samples -f metagenomes -extassembly megahit/final_assembly.fasta -t 64

Where I provided my 17 metagenomes in "metagenomes" directory instead of raw reads as samples -f metagenomes

I ran the whole pipelines and obtained the taxonomy, KEGG abundances etc as samples represent genomes now. That's good and that's what I want. Do you think that's the right way?

Now, I only need to find gene abundances in each metagenome. I couldn't find this information from R tutorial on your github page.

I really appreciate your help!

mars188 avatar Jun 27 '22 13:06 mars188

I believe I can get the genes annotated with sqm_annot.pl script but then how can I estimate gene abundance in each genome?

mars188 avatar Jun 27 '22 14:06 mars188

Where I provided my 17 metagenomes in "metagenomes" directory instead of raw reads as samples -f metagenomes

This is not the intended way to do it.

If you have complete genomes (or bins) and you want to annotate them and estimate their abundance you should instead pass them with -extassembly, so SqueezeMeta skips assembly and uses them as a reference. You must also provide the metagenomic reads wiht -f as usual. These reads will be then mapped back to the input genome/s in order to estimate the abundance of each gene and contig.

fpusan avatar Aug 01 '22 11:08 fpusan

Closing due to lack of activity, feel free to reopen

fpusan avatar Oct 26 '22 11:10 fpusan