F1000_workflow icon indicating copy to clipboard operation
F1000_workflow copied to clipboard

Identifying outliers

Open Ninahuadiy opened this issue 8 years ago • 6 comments

Hello,

I had a question regarding how you identified the outliers by their sample IDs from Figure 8? Also, why did you choose the 12th column for the rel_abund data frame?

Also, when I tried to install Shiny-phyloseq because I read that I can detect outliers using the app, I encountered an issue. I've attempted to install it both automatically and manually. The error was produced using the manual method. screen shot 2017-01-16 at 9 45 59 pm

Ninahuadiy avatar Jan 15 '17 23:01 Ninahuadiy

Hi @Ninahuadiy. I think once we noticed the outliers in the figure, we looked those coordinates up in the original out.wuf.log object, ordering rows by their value on Axis 2, for example. Once we knew those samples, we could look at the original data, which is what that 12th column is. I can check in more detail if you'd like, but I think this is what's happening.

Hm, for the bioconductor installation, were you running source("http://bioconductor.org/biocLite.R")? Might it have been an internet connection error?

krisrs1128 avatar Jan 21 '17 03:01 krisrs1128

And the columns are the genetic code for the RSVs? So, did you look through all of the RSV columns to observe the amount of samples that fell under one specific RSV to determine which RSV dominated the outliers?

And the shiny app installation worked! I haven’t explored it fully yet, but I’ll let you know if I encounter another issue.

ninaxhua avatar Jan 21 '17 20:01 ninaxhua

Mhm, each column is an RSV. But it's not actually hard to see which RSV dominates samples. For example, once you know sample F15D165 is an outlier, you can identify the top 10 RSVs in that sample using

order(otu_table(ps)["F15D165", ], decreasing = TRUE)[1:10]

Also, I took a look at the preprocessing.R script, and remembered that @jfukuyama had written a function (select_ggplot) that lets you identify the sample IDs associated with points in the figure (so, it's more straightforwards than what I wrote earlier). Just save the plot_ordinations object for Figure 8 into an object (say, called p), and run select_ggplot(p).

krisrs1128 avatar Jan 21 '17 20:01 krisrs1128

I ended up using the "label" argument to plot the sample id's onto the chart since I'm working with a small subset of my data for now. I used the select_ggplot function, but I received the following error when I selected a point and hit "Done": screen shot 2017-01-21 at 3 22 04 pm

For the ccpna plot portion, I was hoping to split the plots by the following emotion categories: good, stressed, and tired using facet_grid. However, it didn't split the plot and I'm left with NA as the title? I did change the "ccpna-join-data" section because I couldn't join the species and tax tables. I created an otu_id column for the tax table and did a left join using that column. ccpna

ninaxhua avatar Jan 21 '17 21:01 ninaxhua

Hm, the use for select_ggplot might not be entirely obvious -- did you brush over points and click "add points" before clicking done? Otherwise, it will print an empty data frame (like what you showed).

For the second part of your question, I can only give general ideas, because your data are different. I would check that the emotion column is not all NA (it's possible that column wasn't joined correctly?), and also that the facet_grid call looks something like facet_grid(emotion ~ .).

krisrs1128 avatar Jan 21 '17 23:01 krisrs1128

It identified it for one of the points, but I think I'll test it out again when I receive the rest of my sequenced data.

And thank you for pointing out the column! The sites data frame didn't end up receiving the column for emotion, so I just replaced it.

ninaxhua avatar Jan 22 '17 09:01 ninaxhua