DEICODE icon indicating copy to clipboard operation
DEICODE copied to clipboard

Produce robust clr matrix as output.

Open davised opened this issue 4 years ago • 6 comments

Some users would benefit from being able to access the rclr transformed values, including myself.

I wrote this quick addition to return the rclr matrix and print it as tsv.

See issue #51 as an example.

I've been unhappy with the clr transforms implemented in R, but so far I'm happy with this one.

Thanks for putting the time to implement the matrix completion and rclr functions.

davised avatar May 12 '20 18:05 davised

Thanks for this addition, very useful. Using the rclr output matrix I compared imputed vs non-imputed data, for a feature with no missing values. For such features, I would expect no change after imputation. However, all values seem to be different after imputation, with no correlation to the original data. Is this an expected behavior, and if yes why? Or could feature IDs eventually get swapped along the way? Many thanks in advance for any clarifications! image

madeleineernst avatar Feb 22 '21 16:02 madeleineernst

Hi Madeline, could you include a subset of the data to give us a better idea what we are looking at?

If it helps, RCLR is making a missing-at-random assumption; so it is possible that there are microbes/metabolites at high abundance, but aren't detected due to some random process. Depending on how you are simulating / stratifying your data, you could be seeing scenarios where this assumption is clearly violated.

Also probably should mention that RCLR performs a log-ratio transformation, which is why there are some negative values here.

On Mon, Feb 22, 2021 at 9:13 AM madeleineernst [email protected] wrote:

Thanks for this addition, very useful. Using the rclr output matrix I compared imputed vs non-imputed data, for a feature with no missing values. For such features, I would expect no change after imputation. However, all values seem to be different after imputation, with no correlation to the original data. Is this an expected behavior, and if yes why? Or could feature IDs eventually get swapped along the way? Many thanks in advance for any clarifications! [image: image] https://user-images.githubusercontent.com/2773739/108735717-2a645d80-7531-11eb-8aa3-b66b4965d51e.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/DEICODE/pull/59#issuecomment-783487926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXJBLJ4FXYZI5F5J75DTAJ7EFANCNFSM4M7CHJIQ .

mortonjt avatar Feb 22 '21 16:02 mortonjt

Hi Jamie,

Many thanks for your quick reply. Unfortunately, I won't be able to post a subset of the actual data here, due to data privacy issues. However, I have created a random subset of features and blank samples that illustrates my doubt. For example, if I plot feature '1' of the attached data table, before and after imputation, I would expect perfect correlation, or in other words, I don't expect any change in the data structure as feature '1' did not contain any 0 values. I have also tried to plot the rclr transformed original table (without imputation), against imputed values, but the same pattern persists. It seems that imputation changes data structure also for features where no imputation is needed?

Best, Madeleine Blanks.txt

table = load_table('Blanks.biom') ordination, distance, robust_clr = auto_rpca(table) table = table.to_dataframe().T

x = robust_clr['1'] y = table['1']

print(np.corrcoef(x, y))

plt.scatter(x, y) plt.title('Imputed vs non-imputed data') plt.xlabel('Imputed') plt.ylabel('Non-imputed') plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='red') plt.show()

madeleineernst avatar Feb 23 '21 15:02 madeleineernst

Hi @madeleineernst,

There are a few things going on here, in addition to @mortonjt's valid points, but this is a great question. 

  1. There is a transformation of the data before RPCA is run. To transform only the non-zero values we use an altered clr transform. We go into depth on how that transformation is done in the paper (here) in the section Preprocessing with rclr. So we need to compare only the observed (non-zero) values between the unimputed rclr and output from RPCA. 
  2. The RPCA function was not really meant to be used for imputation. So we will need to alter this PR a bit as a standalone function. The lack of correlation is likely being distorted in the re-centering step after running the MatrixCompletion function (see here). 

If you wanted to compare the observed (non-zero) values 1:1 as an imputation you would need to do the following:

import numpy as np
import pandas as pd
from biom import load_table
from deicode.preprocessing import rclr
from deicode.matrix_completion import MatrixCompletion
import matplotlib.pyplot as plt


# import table
table = load_table('Blanks.biom')

# Robust-clr (rclr) preprocessing and OptSpace (i.e. RPCA)
# As you increase the rank (n_components) the correlation will get tighter (and vice versa).
# ^ See the low-rank assumption for why. 
opt = MatrixCompletion(n_components=10).fit(rclr(table.to_dataframe().T)) 
# reconstruct imputed data
robust_clr_imputed = opt.sample_weights @ opt.s @ opt.feature_weights.T

# do pre-processing step within RPCA
table = table.to_dataframe().T
robust_clr_unimputed = rclr(table)

# use this to mask any "imputed" values
zero_mask = (table == 0.0)

# re-mask imputed values for comparison
robust_clr_imputed_masked = robust_clr_imputed.copy()
robust_clr_imputed_masked[zero_mask] = np.nan
 
# flatten and keep only observed
x = robust_clr_unimputed.ravel()
y = robust_clr_imputed_masked.ravel()
x = x[np.isfinite(x)]
y = y[np.isfinite(y)]

# print the correlation between values
print(np.corrcoef(x, y))

# plot the observed values
plt.scatter(x, y)
plt.title('Observed values from imputed vs. non-imputed data')
plt.xlabel('Observed Imputed')
plt.ylabel('Observed Non-imputed')
plt.plot(np.unique(x),
         np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),
         color='red')
plt.show()

giving

[[1.         0.96985866]
 [0.96985866 1.        ]]

corr

cameronmartino avatar Feb 24 '21 02:02 cameronmartino

Hi @cameronmartino, 

That was very useful, many thanks for the thorough clarifications and code!

My original data matrix consists of app. 2-3000 features and 600 samples. It turns out that if I increase the n_components parameter to 100, I can reach a correlation between imputed and non-imputed non-zero values of up to 0.88, while at the same time being able to impute more than 65% of the sparse data table. If I use n_components of 10, the correlation is only 0.7.

Would you recommend optimizing the n_components parameter until a satisfactory correlation is achieved while at the same time imputing a reasonable amount of data, even if n_components get as high as 2-300 or more? I understand that the rank must be lower than 600, if my matrix has the dimensions 600x3000 (A = m × n matrix and m ≥ n, then rank (A) ≤ n). In the documentation you suggest n_components to be between 1 and 10.

madeleineernst avatar Feb 24 '21 15:02 madeleineernst

Glad that helped @madeleineernst. Just like all dimensionality reduction methods (e.g. PCoA, PCA, ..etc) the OptSpace method (used here) depends on a low-rank assumption. That means we assume there are some latent categories. For example, a simple example in the microbiome might be sample type or in the case of movies, a latent categorization might be genre. You are correct that the max rank (n_components) can be is the smallest dimension of the input matrix. The low-rank assumption is essential for making the problem tractable at scale - increasing the rank will dramatically increase the runtime. That is why we suggest keeping it low (below 10) but your estimates may get better as you increase the rank. All of this is why we really focused on this tool as a dimensionality reduction rather than an imputation method.

cameronmartino avatar Mar 01 '21 16:03 cameronmartino