metacoder icon indicating copy to clipboard operation
metacoder copied to clipboard

Metacoder observation table names, calculation functions, and dplyr style additions to the package?

Open grabear opened this issue 6 years ago • 3 comments

Introduction

I'm not sure if this is relevant for all of your workflows, but I was thinking about the naming conventions used in metacoder and phyloseq.

For the project I'm working on we decided to go with a different naming convention for the observation data (aka names(mc_obj$data)). Our project imports the data using phylsoeq and then converts it with parse_phyloseq, so the context may be limited to this function:

# The metacoder_obj$data will contain the following tables.
# The keys here are the old phyloseq table names, and the values are the new table names
...
otu_table: "otu_abundance"
tax_data: "otu_annotations"
sample_data: "sample_data"
phy_tree: "phy_tree"

(Note: The "otu_annotations" table makes more sense to with respect to https://github.com/grunwaldlab/metacoder/pull/253 After calculation, I decided to use these other naming conventions to match what we had above, and they seem to be more intuitive.

# These new tables are created by metacoder's calculation functions
...
calc_taxon_abund(otu_abundance): "taxa_abundance"
calc_obs_props(otu_abundance): "otu_proportions"
calc_obs_props(otu_proportions): "taxa_proportions"
...

Questions

1. Do you like these ideas?

  • I think it would make more sense to the end user. Especially those who are newer to microbiome work.
    • We could make them shorter too:
    otu_table: 
        option1: "otu_abund"
        option2: "otu_counts"
    tax_data: 
        option1: "otu_anno"
        option2: "otu_taxa"
    sample_data: 
        option1: "sample_data"
        option2: "metadata"
     phy_tree: "phy_tree"
    

2. Would you consider manipulating the way the calc_* functions work?

  • I don't want to change the underlying functionality, I just want to add an additional way to direct the output:

        # Allowing the user to do this:
        > mc_obj$data$new_table <- calc_*(mc_obj, ...)
        # And this:
        > calc_*(mc_obj, new="new_table")
        # The table is added to mc_obj$data implicitely unless directed
        ```
    
    

3. Would you consider adding functionality to the calc_* functions so that they generate default observation table names based on verified data types (e.g. phyloseq)?

  • Below are my ideas with shortned naming conventions:
...
# if data = "otu_abund", then table_name = "taxa_abund"
calc_taxon_abund(otu_abund): "taxa_abund"
# if data = "otu_abund", then table_name = "otu_prop"
calc_obs_props(otu_abund): "otu_prop"
# if data = "otu_prop", then table_name = "taxa_prop"
calc_obs_props(otu_prop): "taxa_prop"
...
  • This would probably be hard locked to specific circumstances like parse_phyloseq in the short term.

Final

I can work on most of these items on my fork.

grabear avatar Nov 16 '18 19:11 grabear

Thanks @grabear, I will try to look at this more closely this weekend and get back to you.

zachary-foster avatar Nov 17 '18 00:11 zachary-foster

Follow Up

I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?

The package will also include a workflow that utilizes otu_ids https://github.com/grunwaldlab/metacoder/pull/253, correlation plots and agglomeration function from https://github.com/grunwaldlab/metacoder/issues/234, and adds other phyloseq style data filtering functions for metacoder objects:

  • filter by the coefficient of variation
  • Prevalence filtering (taxonomic and OTU based)
  • filter otus by overall proportion (master threshold)
  • filter samples with zero abundance

I've also created some functions based on my suggestions above:

  • format_metacoder_object renames tables, and creates the proportions and taxa tables
  • validate_metacoder_object validates that the metacoder object has been formatted
    • this is needed for the other filtering functions to work properly.

New Items

4. Would you consider letting me add some calc_*/dplyr-style functions that take a function as a parameter and allows you to transform the table based on that function?

  • I'm thinking in the same vein as phyloseq::transform_sample_counts, phyloseq::filter_taxa, and metacoder::calc_obs_props, but instead taxa::transform_obs or metacoder::calc_obs_trans that allows you to transform rowwise (OTUs/taxon_id) or columnwise (samples)
    • This could be broke up into row/column transformation functions.
    • This could be
  • Something like metacoder::calc_stat for rowwise/columnwise calculations.

5. Would you consider adding obj$otu_id() or obj$alt_id() to taxa::all_names()?

  • This way you can filter by alternate ids include OTU, genbank, NCBI, etc.

grabear avatar Nov 19 '18 19:11 grabear

1) renaming parse_phyloseq output

"otu_table", "sample_data", "tax_table", and "phy_tree" were chosen because those are the names used in a phyloseq object. I agree that "otu_abund" is better than "otu_table", but the rest seem fine as they are. I dont use phyloseq too much, but I assumed people would be less confused if the names stayed the same?

2) Would you consider manipulating the way the calc_* functions work?

Yea, the obj$data$my_table <- thing gets old. Sure, calc_*(..., out = "my_table") sounds good. It still should return the created table though, but with invisible(), so you don't see it printed on the screen, since I would like the return type to be consistent.

3)

This relates to 2 as well. If we added an option like that above, then there could not be a default for table name, since the default would be to not add the table, but return it, like is currently done. I kind of like forcing the user to come up with their own name in this case. If we added an R6 method, there could be a default table name:

obj$calc_*()

taxa has R6 variants of all its functions for modifying data without needing to use the returned value, but I think that might confuse the average R user. See https://adv-r.hadley.nz/r6.html#adding-methods-after-creation. If there was a default table name, I would like it to be either always the same or add a consistent suffix to the input table name. Ideally the user should be choosing the name anyway, so they make something that makes sense to them.

I'm working on some of this in a private repository for one of my projects. Maybe when we finish the package I can present to you what I've done and we can work from there?

You have them in another R package? That sounds good. Yea, once you have them ready, let me know and I will look at them. We can then either add them to metacoder, or leave them in another package.

I've also created some functions based on my suggestions above:

Those sound useful, but perhaps too workflow-specific. I am trying to keep metacoder like a tool kit and not hold the users hand too much. Perhaps those and other such functions could go in a microbiomeWorkflows package that focuses on quickly making workflows using metacoder and phyloseq for microbiome projects? I would be interested in helping with something like that, but I probably wont make it myself any time soon, since I am more interested in making tools than workflows.

4) metacoder::calc_stat?

Turns out I already have a function called metacoder::calc_group_stat, which does per-row calculations with optional grouping, given a function as an argument. I do not have a function to do per-column transformations, like calc_obs_props but more abstracted and takes a function as an option. How about this:

  • calc_row_stat (currently calc_group_stat): operates on rows, possibly grouped by column attributes. The user-supplied function takes multiple values from a single row and returns a single value. Example usage: sum OTU counts by sample type.
  • calc_col_stat: operates on columns, possibly grouped by row attributes. The user-supplied function takes multiple values from a single column and returns a single value. Example usage: Get number of OTU in each family with non-zero abundace (using the family rank name as a per-row/otu attribute).
  • calc_row_trans: operates on rows, possibly grouped by column attributes. The user-supplied function takes multiple values from a single row and returns the same number of values. Example usage: transform proportions into standard deviations from a sample mean (columns grouped by sample).
  • calc_col_trans: operates on columns, possibly grouped by row attributes. The user-supplied function takes multiple values from a single column and returns the same number of values. Example usage: calculate proportions from counts when two OTU tables have been rbinded together with a column identifying the source OTU table (grouped by OTU table source).

5)

You should be able to get that info from all_names already, if those are columns in a table. All column names in all tables are in all_names(). Does this work for you or is there a bug somewhere?

Thanks!

zachary-foster avatar Nov 19 '18 22:11 zachary-foster