HiCExplorer icon indicating copy to clipboard operation
HiCExplorer copied to clipboard

hicPCA and compartmentalization. Setting interaction range.

Open kimj50 opened this issue 4 years ago • 7 comments

Hi, Thank you for the amazing tool. I was wondering if it is possible to compute hicPCA or hicCompartmentalization on matrix with limited interaction range (ex. use interactions between loci less than 5mb). This largely comes from problem that is specific our system: the segments of the compartments are super tiny, and the checker-board pattern falls off quickly with increasing genomic distance. So we have to use fairly small bin size (no greater than 10kb) for PCA to work, but this also increases noise in long-range interactions. Heres a snap shot of 2mb region. image We've dealt with this issue by combination of sequencing more reads and subseting matrix using hicAdjustMatrix, but we don't like that we are computing PCA for different regions of chromosomes separately. So, I was hoping if you guys had some insights on this. Thank you, Jun

kimj50 avatar Dec 11 '20 16:12 kimj50

Hi Jun,

so you basically need a --maximalRegion parameter that considers only the interactions for this genomic distance?

I need to check the source and the mathematical properties to see if this is possible.

Best,

Joachim

joachimwolff avatar Dec 14 '20 08:12 joachimwolff

Yes! thank you - Jun

kimj50 avatar Dec 14 '20 14:12 kimj50

I had the same issue here: https://www.cell.com/cell/pdf/S0092-8674(20)30621-8.pdf (early drosophila embryos with low compartmentalization). We ended-up computing PCA for submatrices of pre-defined size (near the diagonal, tried different sizes) with a sliding window approach and stitched the resulting eigen vectors. The advantage of sliding window is that you can compare if the eigenvector "signal" is similar or not between overlapping regions, and they most likely are very similar, so it's safe to stitch the eigenvector values of adjacent windows. The method is described in the supplementary of that paper. I might have a jupyternotebook about this lying around.

gtrichard avatar Jan 05 '21 06:01 gtrichard

Oh man. This is exactly what we were looking for! Finding the proper values for sliding window (a,a+w) will be important for our system, c elegans. Conceptually I can understand the method section, but I come from biology background, so I will definitely have a lot of trouble executing the method section. Could you share with us if you find that notebook? Thanks! - Jun

kimj50 avatar Jan 05 '21 17:01 kimj50

This is a very, very bad idea.

PCA is not numerical addition/multiplication. It is a matrix transformation, so when you do a subset of your data, you change the space that you are operating in, and therefore the decomposition (which is what the PCA is, essentially) get modified in unpredictable ways. A Compartment analysis of a subset of a chromosome does not necessarily retain the Compartment done on the whole chromosome arm.

There are algorithm to approximate PCAs for large matrix. This is not the way at all. See for yourself. I did a 8x8 vs 4x4 example.

import random
import numpy as np
from sklearn.decomposition import PCA

#set seeds for reproducible results
random.seed(16447)
numbers = []
#size of full dim matrix
n=8
for x in range(0,n*n):
    numbers.append(random.gauss(0,3))

y = np.array(numbers).reshape((n,n))
# for simplicity, do Y^T * Y for a symmetric matrix
full_mat = np.dot(y.T, y)

# chose the first 4 column and rows 
partial_mat = full_mat[0:4,0:4]

#fit PCA for both matrices
pca_partial = PCA(n_components=4)
pca_full = PCA(n_components=n)

pca_partial.fit(partial_mat)
#check variance, singular value, PC
print(pca_partial.explained_variance_)
print(pca_partial.explained_variance_ratio_)
print(pca_partial.singular_values_)
print(pca_partial.components_)

#none of them match
pca_full.fit(full_mat)
print(pca_full.explained_variance_)
print(pca_full.explained_variance_ratio_)
print(pca_full.singular_values_)

mxw010 avatar Dec 13 '21 00:12 mxw010

I am not sure if this about our tool or the latest comment on this issue. If you are talking about how we handle hicPCA after hicAdjustmatrix , this is not about the size of matrix but about dealing with the biological question we are trying to answer. The reason to remove a region of chromosome is to remove a bias from the not very well assembled parts or very repetitive centromeric regions. If those regions well be kept, we will introduce variation which are nothing but artifact. Of course is up to the user to be careful about what part of genome they are removing.

LeilyR avatar Dec 13 '21 08:12 LeilyR

Thanks for your contribution, but I don't get your point. A different input will lead to a different result, that's quite clear. However, even if the result is not 1:1 comparable to a PCA on the full matrix, and also if the result is something else in mathematical terms, it might still be a valid result in terms of biological interpretation. It's then just a different algorithm. You always need to verify your results with orthogonal data, and if that shows it's valid, the approach can be used. Therefore I am not sure what kind of issue you are raising here.

joachimwolff avatar Dec 13 '21 13:12 joachimwolff