clust Subclustering: inconsistent co-abundance pattern

Hello,

I am testing a sub clustering approach as described below:

Execute a Clust on data set X(1 tabular file).
Get genes from an interesting cluster Y.
Create a new dataset Z from dataset X that contains only genes from cluster Y.
Execute Clust on dataset Z with increased tightness (-t ~50) to try to get genes with a more refined behavior from cluster Y.

The idea is trying to get genes that, for example, are increasing at every step of a time course.

Curiously, I just got into the situation in which the newly generated sub-cluster contains a series of genes which behavior is basically contrary to the one reflected by the original cluster. I am attaching images of the 'original' cluster and the 'sub-cluster'.

Original cluster (163 genes). The colored lines represent the genes that appear in the new sub-cluster below. plotcluster8

Sub-cluster (18 genes). sample_to_clust_issue2

The behavior is kind of antagonistic depending on which cluster you look, but the genes and the original data set are the same.

Am I missing something here in terms of interpretation? Do you have any idea what could be happening here?

Maybe this is a more theoretical question regarding statistics and the way the data is being normalized and how the clusters are being created. I would be grateful for any input regarding these doubts.

This is an exploratory study for selecting candidates for a targeted approach later on, so I wouldn't rely on candidates that show a different behavior according to the data analysis approach.

Many thanks in advance for any support!

Best, Miguel

Mar 31 '20 13:03 MiguelCos

Hi Miguel,

I might not be able to help with your question, but can you check the Processed_data file, what happened to your input data file after normalization by clust? Does that explain anything that could have happpened? See also my issue on github in which the by clust normalized data also show a strange behavior compared to the imported data. However, I do not see "opposite" patterns as you show here.

Jun 20 '20 12:06 incle440

@MiguelCos @incle440 I too experienced inconsistent results between the cluster profiles shown in the output PDF and the profiles I viewed manually for entities in an assigned cluster. My normalization codes were 101, 4 (quantile norm, z-score norm).

I then ran without QN and just the z-norm and got consistent results. Thus on a cursory glance it seems the QN is the culprit. @MiguelCos could you confirm if you also had code 101?

Nov 05 '20 23:11 ijhoskins

The issue (see #error Processed_Data file (normalized data)) remains for me as well.

The clustered profiles obtained from the output pdf and the normalized data from the Processed Data file) are inconsistent and show opposite expression patterns.

I used 101 31 4 for normalization.

Could @BaselAbujamous help please?

Nov 06 '20 10:11 incle440

Hello,

I am sorry for taking so long to answer. At the time of posting this issue, I was performing several explorative analyses that were kind of abandoned some time afterwards, and now I cannot find the exact output that produced these results, so I can't confirm 100%. I can say though that at that moment I wasn't playing with the normalization options. I just kept the normalization on default mode and modified the -t input, for these first tests.

One thing that 'solved' my problem with this apparent inconsistency, was that for subsequent 'sub clustering', I used the normalized values of abundance as they were thrown by the first step of Clust - clustering, not the original data, and this time I used no normalization. This prevented the normalization and therefore the quantitative info stayed the same at this sub-step, but the co-expression classification to be consistent.

Nov 06 '20 15:11 MiguelCos

hi I just got a similar issue using clust v1.12.0, I got a cluster like this :

then I wanted to extract genes in C3 to do further analysis, but found 4 of 95 genes got opposite trend, like this one _processed.csv:
Genes t1 t2 t3 Tagln2 1.38 -0.41 -0.96

input tpm:
ID t1_1 t1_2 t1_3 t1_4 t2_1 t2_2 t2_3 t2_4 t3_1 t3_2 t3_3 t3_4 Tagln2 614.08 674.89 684.78 639.18 1012.77 1098.41 1188.32 1112.88 1114.07 1138.98 1019.55 1194.58

I'm now exploring the code to find out what happned ...

Jul 05 '21 10:07 Ruismart

Hi @Ruismart
thanks for your message. This confirms my observations; an opposite trend between the pattern in the plot and the normalized data. It would be great to have this figured out as I don't have the skills to check the code and solve this issue.

Kind regards, Inge

Jul 05 '21 16:07 incle440

@incle440

seems it's caused by '101' in ' -n 101 31 4', then it might not be a real bug.

I have checked a few .py files under 'scripts' hoping to find out if some code would change the expression value.

in 'preprocess_data.py', defined 'normaliseSampleFeatureMat(X, type)' to normalize the input data:
'101' means doing quantile normalization, which would reorder genes' value of each sample, take mean of reordered gene, then back to the raw order, resulting a similar distribution for all samples. genes in regions of sharp changes would become very different.

for me, the input data is processed and filtered tpm expression matrix, I don't want to do quantile normalization, but '3' (take log2) and '4' (zscore tranform) are OK, so just to run without '101' could avoid the 'inconsistent issue'.

just as ijhoskins commented ...

Jul 08 '21 04:07 Ruismart

clust clust copied to clipboard

Subclustering: inconsistent co-abundance pattern

clust
clust copied to clipboard