emp icon indicating copy to clipboard operation
emp copied to clipboard

100 most wanted list

Open gregcaporaso opened this issue 12 years ago • 31 comments

The OTUs that are abundant across many environment types and distance from sequences in Greengenes/NCBI. We'll have to develop a sorting scheme for this, but would be a way to provide a list of the "most wanted" OTUs, or the high abundance cosmopolitan organisms that are not well-characterized.

gregcaporaso avatar Aug 06 '12 17:08 gregcaporaso

Greg and I discussed this and decided on a sorting scheme. The most wanted list will only include "new" OTUs (i.e. ones that were created de novo, not from greengenes).

Sorting priorities:

  1. Sort by the number of environments the OTU is found in.
  2. Sort by the total count across all environments.
  3. Sort by % dissimilarity to greengenes.
  4. Sort by % dissimilarity to NCBI nr database.

Output should include a tab-separated table containing the sorted most wanted OTU IDs, sequence, greengenes assigned taxonomy, and NCBI closest sequence link.

Additional output should be an HTML table (for easy integration into the EMP website) that contains the information above plus a piechart showing the abundance of the OTU in each environment.

jairideout avatar Aug 06 '12 21:08 jairideout

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.
  6. Pick the top N from those.

We'll see how this works...

jairideout avatar Aug 08 '12 17:08 jairideout

We could just look at the ones that were new clusters (i.e. don't have gg ids because they failed ref picking), right?

On Aug 8, 2012, at 11:27 AM, jrrideout wrote:

It looks like this approach probably won't work because the top N results will just be abundant OTUs found in many environments that will (most likely) be very similar to either gg or the nt database. The alternative is to first sort by % dissimilarity to the databases, but this could be really expensive (i.e. take a long time to complete, time we don't have).

A filter-based approach (as opposed to multiple levels of sorting) will probably work better. Greg previously did something similar and it seemed to work okay (though some of the parameters may need to be played with to get a good list).

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.
  6. Pick the top N from those.

We'll see how this works...

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7590793.

rob-knight avatar Aug 08 '12 18:08 rob-knight

Yes, that will be the first step in the process, but I think we'll need to do additional filtering (steps 2-5) to get a good list, because many of these novel OTUs might be very similar to either gg seqs or nt seqs.

jairideout avatar Aug 08 '12 19:08 jairideout

@meganap, would you be able to help @jrrideout with some css magic to make the html table that he's putting together for this look a little nicer?

gregcaporaso avatar Aug 13 '12 23:08 gregcaporaso

sure no prob

meganap avatar Aug 14 '12 02:08 meganap

@meganap awesome, thanks! I'm finishing up some changes tonight and will have the table in the repo sometime tomorrow. Will let you know when it is ready.

jairideout avatar Aug 14 '12 05:08 jairideout

Once @meganap takes a crack at it, it'd be best to include her css in the html generation code for future runs.

gregcaporaso avatar Aug 14 '12 13:08 gregcaporaso

@meganap, the table is in the repo now under isme14/most_wanted_otus/most_wanted_otus.html. To view it, open it up in a web browser (I've tried out Chrome and Firefox) and it should find all of the other files it needs (they are all under that same directory).

I tried to keep styling to a minimum. The table has the id 'most_wanted_otus_table' and each of the subtables for the piechart legends have the class 'most_wanted_otus_legend'. If there's anything else I can do from my end to help make this HTML better stylizable, please let me know.

I think the goal was to add this table to one of the EMP webpages. Thus, I'm not sure if we should directly add the CSS to the table-generating code as @gregcaporaso suggested because it may better to just use the EMP CSS stylesheets that are already in use on the website. You may need to get in touch with @douginator2000 to get access to those if you don't have them already. If we go this route, the table-generating code will be able to create generic tables which can then be styled according to whatever website scheme it might be dropped into (thinking of additional uses for this table besides the EMP website).

Thanks again for your help with this, and please let me know if you come across any issues.

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance. You mentioned offline that there might be a way to get access to a node with more memory (>69GB). Do you still want to go this route, or just use the table that we have?

jairideout avatar Aug 14 '12 22:08 jairideout

@meganap, I forgot to mention that the second column in the HTML table needs to keep its contents formatted as-is (I'm using pre tags currently, maybe there is a better way to do this though). We just need to keep it formatted with fixed-width font and have those linebreaks respected.

jairideout avatar Aug 14 '12 22:08 jairideout

@jrrideout cool, I'll take a crack at this tomorrow

meganap avatar Aug 14 '12 22:08 meganap

Thanks guys!

@gregcaporaso, this most wanted table does not include OTU tables 1288, 933, and 550 because they were too big to filter on an m2.4xlarge EC2 instance.

I think we just have to go with this for right now, but for the paper we'll get this running on a system with more memory.

gregcaporaso avatar Aug 14 '12 22:08 gregcaporaso

@douginator2000, when this is ready could you add a another collapsable section on the EMP login page (same place as the summary statistics, etc)?

gregcaporaso avatar Aug 16 '12 13:08 gregcaporaso

@gregcaporaso @jrrideout Sorry I didn't get a chance to work on this yet since I was working on figures for other isme stuff, but is there still time for this?

meganap avatar Aug 17 '12 20:08 meganap

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

jairideout avatar Aug 17 '12 20:08 jairideout

Yes still useful, deadline sunday

On Aug 17, 2012, at 4:35 PM, "jrrideout" <[email protected]mailto:[email protected]> wrote:

I think there is (though I'm not 100% sure what the deadline was for this). I'm available to help however I can from my end of things.

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7834337.

rob-knight avatar Aug 17 '12 20:08 rob-knight

hey @jrrideout I noticed that there aren't any html headers for the file and that it just starts off with divs. Is there a reason for this? Adding css styling is only possible if we have html headers.

meganap avatar Aug 17 '12 21:08 meganap

@meganap, @gregcaporaso requested that I only output the HTML table so that it could be easily dropped into a webpage. Please feel free to modify/add to the HTML as needed to style it (this table will ultimately need to be added to the EMP login page).

jairideout avatar Aug 17 '12 22:08 jairideout

@jrrideout I've edited the script that writes the html so it writes some stuff in a different way, can you send me the full command you used to run that script (like where the test files are?) so that I can rerun it?

meganap avatar Aug 17 '12 23:08 meganap

@meganap I'll have to rerun it because it requires the entire nt database, and everything is already set up for this in an EC2 instance. Can you please update the accompanying unit tests and check in your changes? Once they're in, I'll rerun it and commit the latest results to the repo. It won't take long to run.

jairideout avatar Aug 18 '12 01:08 jairideout

@meganap The changes are in; please let me know if you run into any issues.

jairideout avatar Aug 18 '12 02:08 jairideout

@douginator2000 this is all ready to go. All relevant files are under isme14/most_wanted_otus/. The only file that you can exclude from there is 'analysis_notes.txt'. Thanks!

@meganap thanks for your help in spicing up the table- it looks really good!

jairideout avatar Aug 18 '12 04:08 jairideout

Hey guys, This is awesome, thanks! Doug, could you get this accessible via the EMP site?

In the meantime I posted here to make it easier for everyone else to see: https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the metadata categories in addition to the percentage, but I think that can wait. (Thanks for the suggestion Daniel!)

Greg

gregcaporaso avatar Aug 19 '12 09:08 gregcaporaso

Yes this is spectacular -- thanks for putting together! Could we get a tree showing where in phylogeny the 100 most wanted are?

On Aug 19, 2012, at 11:53 AM, "Greg Caporaso" <[email protected]mailto:[email protected]> wrote:

Hey guys, This is awesome, thanks! Doug, could you get this accessible via the EMP site?

In the meantime I posted here to make it easier for everyone else to see: https://dl.dropbox.com/u/2868868/most_wanted_otus/most_wanted_otus.html

One thing we'll want to do is include the number of samples for each of the metadata categories in addition to the percentage, but I think that can wait. (Thanks for the suggestion Daniel!)

Greg

— Reply to this email directly or view it on GitHubhttps://github.com/EarthMicrobiomeProject/isme14/issues/23#issuecomment-7851896.

rob-knight avatar Aug 19 '12 13:08 rob-knight

Am I right to think that the criteria for this are those that @jrrideout came up with:

  1. Filter to only include novel OTUs.
  2. Filter to only include high abundance OTUs, within a specified range (i.e. 100 < OTU count < 500).
  3. Filter to only include OTUs that are in at least N environments/sample types.
  4. Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).
  5. BLAST the rest against nt and sort by % dissimilarity.

gilbertjack avatar Aug 19 '12 13:08 gilbertjack

Yes, that's right. @jrrideout, correct us if we're wrong here.

gregcaporaso avatar Aug 19 '12 13:08 gregcaporaso

ok but what were the N's for these two filters: 3) Filter to only include OTUs that are in at least N environments/sample types.
 4) Filter to only include OTUs that are at least some percent dissimilar from Greengenes (e.g. 20% dissimilar).

gilbertjack avatar Aug 19 '12 14:08 gilbertjack

@gilbertjack The steps 1-5 listed above are what I used. Here's the parameters I ended up using:

  1. filtered out against gg 97
  2. abundance: 100 < OTU count < 500
  3. at least 4 environments
  4. included only OTUs that were at least 20% dissimilar (according to uclust) from gg 97
  5. only included OTUs that were 97% similar or less compared to the NCBI nt database (according to blastall)

So we only ended up with 45 OTUs that were left over after all of that filtering. Please let me know if you have any additional questions regarding how this list was generated.

@gregcaporaso @rob-knight I think these feature requests sound great, though I will not have time to work on them to meet the deadline today.

jairideout avatar Aug 19 '12 17:08 jairideout

Thanks a lot!

gregcaporaso avatar Aug 19 '12 17:08 gregcaporaso

AWESOME, thanks

gilbertjack avatar Aug 19 '12 17:08 gilbertjack