emp icon indicating copy to clipboard operation
emp copied to clipboard

are there any shared OTUs between all ecosystems surveyed

Open gregcaporaso opened this issue 11 years ago • 13 comments

gregcaporaso avatar Jul 19 '12 00:07 gregcaporaso

I would do this as follows:

1 - define a function that takes a biom-table object and returns a list of the otu ids that have a count of at least n (where n is a parameter to that function) in all samples. this would be similar to qiime.filter.filter_otus_from_otu_table, where you define a filter function that gets passed to table.filter_observations.

2- iterate over the list of OTU tables (see issue #25 for why that is necessary), parsing the BIOM table with biom_format.parse.parse_biom_table, and passing the table to the function defined in step 1, and store the list of otus that are returned.

3- take the intersection of the results of step 2.

I wouldn't be surprised if the answer was that no OTUs are shared across all of the samples. In that case it may be worth investigating whether there are OTUs that are present in at least 99% of the samples, etc, making the percentage parameterizable. You'd achieve this by building a different filter function that gets used in step 1.

gregcaporaso avatar Aug 10 '12 19:08 gregcaporaso

@lkursell, let me know if you need anything as you work on the test code up today.

gregcaporaso avatar Aug 14 '12 16:08 gregcaporaso

@gregcaporaso, the biom tables in the dropbox per_study folder all have "metadata": null, am I looking at the wrong tables? lukeursell ~/Desktop/code_emp/isme14/per_study_otu_tables $grep -c taxonomy *.biom otu_table_mc2_1031.biom:0 otu_table_mc2_1034.biom:0 otu_table_mc2_1035.biom:0 otu_table_mc2_1036.biom:0 otu_table_mc2_1037.biom:0 otu_table_mc2_1038.biom:0 otu_table_mc2_1039.biom:0 otu_table_mc2_1222.biom:0 otu_table_mc2_1235.biom:0 otu_table_mc2_1240.biom:0 otu_table_mc2_1242.biom:0 otu_table_mc2_1288.biom:0 otu_table_mc2_1289.biom:0 otu_table_mc2_1453.biom:0 otu_table_mc2_1526.biom:0 otu_table_mc2_550.biom:0 otu_table_mc2_632.biom:0 otu_table_mc2_638.biom:0 otu_table_mc2_659.biom:0 otu_table_mc2_662.biom:0 otu_table_mc2_678.biom:0 otu_table_mc2_722.biom:0 otu_table_mc2_723.biom:0 otu_table_mc2_808.biom:0 otu_table_mc2_809.biom:0 otu_table_mc2_810.biom:0 otu_table_mc2_925.biom:0 otu_table_mc2_933.biom:0

lkursell avatar Aug 14 '12 18:08 lkursell

No, taxonomy assignment hasn't completed, but I don't think you need it for this - do you? Your analysis should be at the OTU level.

gregcaporaso avatar Aug 14 '12 18:08 gregcaporaso

OK, I'll just report these: emp.isme14..14.CleanUp.ReferenceOTU368968

lkursell avatar Aug 14 '12 18:08 lkursell

Thanks. Report the sequences as well and we can classify those directly (which will obviously go much faster than running it for all of them). One easy way to do that would be to have your code output a file where each line contains tab-separated text. The first field should be the OTU id, and then information such as how many samples/biomes it showed up in in subsequent fields. You can then call filter_fasta.py passing that file as -s, and the new_refseqs.fna.gz as -f to get the corresponding sequences, and classify the resulting fasta file with assign_taxonomy.py (retraining against Greengenes).

Will that work? Sorry for the round-about way of getting these data!

You can find new_refseqs.fna.gz here: https://github.com/EarthMicrobiomeProject/isme14/blob/master/new_refseqs.fna.gz?raw=true

gregcaporaso avatar Aug 14 '12 18:08 gregcaporaso

I'm trying to get the new_refseqs file, but I get a 'Error: blob is too big' when clicking the link in Firefox or Chrome, or when using curl from the terminal.....

lkursell avatar Aug 17 '12 17:08 lkursell

If you do a clone of the repo you'll have access to it:

git clone https://github.com/EarthMicrobiomeProject/isme14.git

On Fri, Aug 17, 2012 at 10:27 AM, lkursell [email protected] wrote:

I'm trying to get the new_refseqs file, but I get a 'Error: blob is too big' when clicking the link in Firefox or Chrome, or when using curl from the terminal.....

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/14#issuecomment-7829396.

jairideout avatar Aug 17 '12 17:08 jairideout

Hey Guys,

Do we have a list yet? for this?

gilbertjack avatar Aug 19 '12 13:08 gilbertjack

@mortonjt Are you interested in taking this one on?

cuttlefishh avatar Jun 14 '16 16:06 cuttlefishh

Also want to assign @amnona but it's not letting me.

cuttlefishh avatar Jul 15 '16 16:07 cuttlefishh

Any update on this?

mariaasierra avatar Apr 06 '19 13:04 mariaasierra

Hi @alehsierra, there was some discussion over on @biocore/american-gut-devs about doing this for the American Gut project (https://github.com/biocore/American-Gut), but that wouldn't address a diversity of environments. It would be easy to do with the EMP BIOM table and mapping file. In our analysis, we did look at the most prevalent sequences across the dataset (https://media.nature.com/original/nature-assets/nature/journal/v551/n7681/extref/nature24621-s4.xlsx). For this issue, it would simply be a matter of cross-reference the BIOM table with the environment types (e.g. empo_3) in the mapping file, and seeing which sequences are found at least once in at least one sample for each sample type.

cuttlefishh avatar Apr 08 '19 19:04 cuttlefishh