F1000_workflow
F1000_workflow copied to clipboard
Identifying outliers
Hello,
I had a question regarding how you identified the outliers by their sample IDs from Figure 8? Also, why did you choose the 12th column for the rel_abund data frame?
Also, when I tried to install Shiny-phyloseq because I read that I can detect outliers using the app, I encountered an issue. I've attempted to install it both automatically and manually. The error was produced using the manual method.
Hi @Ninahuadiy. I think once we noticed the outliers in the figure, we looked those coordinates up in the original out.wuf.log
object, ordering rows by their value on Axis 2, for example. Once we knew those samples, we could look at the original data, which is what that 12th column is. I can check in more detail if you'd like, but I think this is what's happening.
Hm, for the bioconductor installation, were you running source("http://bioconductor.org/biocLite.R")
? Might it have been an internet connection error?
And the columns are the genetic code for the RSVs? So, did you look through all of the RSV columns to observe the amount of samples that fell under one specific RSV to determine which RSV dominated the outliers?
And the shiny app installation worked! I haven’t explored it fully yet, but I’ll let you know if I encounter another issue.
Mhm, each column is an RSV. But it's not actually hard to see which RSV dominates samples. For example, once you know sample F15D165
is an outlier, you can identify the top 10 RSVs in that sample using
order(otu_table(ps)["F15D165", ], decreasing = TRUE)[1:10]
Also, I took a look at the preprocessing.R
script, and remembered that @jfukuyama had written a function (select_ggplot
) that lets you identify the sample IDs associated with points in the figure (so, it's more straightforwards than what I wrote earlier). Just save the plot_ordinations
object for Figure 8 into an object (say, called p
), and run select_ggplot(p)
.
I ended up using the "label" argument to plot the sample id's onto the chart since I'm working with a small subset of my data for now. I used the select_ggplot
function, but I received the following error when I selected a point and hit "Done":
For the ccpna plot portion, I was hoping to split the plots by the following emotion categories: good, stressed, and tired using facet_grid. However, it didn't split the plot and I'm left with NA as the title?
I did change the "ccpna-join-data" section because I couldn't join the species and tax tables. I created an otu_id column for the tax table and did a left join using that column.
Hm, the use for select_ggplot
might not be entirely obvious -- did you brush over points and click "add points" before clicking done? Otherwise, it will print an empty data frame (like what you showed).
For the second part of your question, I can only give general ideas, because your data are different. I would check that the emotion
column is not all NA (it's possible that column wasn't joined correctly?), and also that the facet_grid
call looks something like facet_grid(emotion ~ .)
.
It identified it for one of the points, but I think I'll test it out again when I receive the rest of my sequenced data.
And thank you for pointing out the column! The sites data frame didn't end up receiving the column for emotion, so I just replaced it.