SCALE Question about gamma in scale

Hi, I am trying to apply this to the Pancancer dataset (Sequencing-based gene expression) from the ICGC portal but the result was no good. The data points entangle with each other from donors with different cancers.

As I went through your code, I noticed

    logpzc = -0.5*torch.sum(gamma*torch.sum(math.log(2*math.pi) + \
                                           torch.log(var_c) + \
                                           torch.exp(logvar_expand)/var_c + \
                                           (mu_expand-mu_c)**2/var_c, dim=1), dim=1)
   
    # log p(c)
    logpc = torch.sum(gamma*torch.log(pi), 1)

    # log q(z|x) or q entropy    
    qentropy = -0.5*torch.sum(1+logvar+math.log(2*math.pi), 1)

    # log q(c|x)
    logqcx = torch.sum(gamma*torch.log(gamma), 1)

and I know gamma=q(c|x)=p(c|z). Could you show me how to go from here to logpc and logpcx? I would not use gamma in either case though.

Mar 29 '23 22:03 felix-yuxiang

SCALE is designed for sparse scATAC-seq datasets, of course it can run scRNA-seq data if you add option "--log_transform" in your command. But I highly recommend you use SCALEX (https://github.com/jsxlei/SCALEX), an updated version especially for heterogenous scRNA-seq. If these won't solve your problem, could you provide more details about your data, e.g. shape, batch, and how you run the program?

As for the gamma, gamma is p(c|z), so the p(c) is short for the expectation of p(c) over q(z,c|x), same for the logqcx. e.g. Eq(z,c|x) logp(c) =
Screenshot 2023-03-29 at 22 14 28 while Eq(c|x) logq(c|x) = Screenshot 2023-03-29 at 22 15 42 Hope this could help!

Mar 30 '23 02:03 jsxlei

Thank you for your reply!

The dataset I use is a subset of the bulk RNA seq data from the ICGC portal (6 different cancer types). I run the program with n_centroids=6. Eventually, when I visualize the data encoded in the latent space, I find no obvious clusters for patients with different cancers even though a vanilla autoencoder can cluster the points nicely.

Here is my experimental setup

binary = False
latent = 128
encode_dim = [2048, 512]
decode_dim = [512, 2048]
input_dim = 20502

PS. I understand the equation now but I think the math derivation pictures you put are identical.

Mar 30 '23 18:03 felix-yuxiang

I think the reason is due to the latent dimention you choose is quite large. Just use the default parameter plus the --log_transform. It should work, at the same time, you can also try SCALEX using the default parameters. Let me know whether it solves your problem.

Sorry for put two identical pictures. The equation for the first term is attached here. Screenshot 2023-03-30 at 14 39 43

Mar 30 '23 18:03 jsxlei

I tried to reduce the latent dimension but there is no luck. The clustering result still looks pretty bad. Do you suggest that I should try the default dim in the code? (i.e. latent = 10 encode_dim = [1024, 128] decode_dim = []) Another concern is that the kl loss is negative during training, as shown above. I use MSE loss for the reconstruction loss.

In addition, I think the arg log-transform has never been used in the code in __init__.py. I manually log-transform my data by X=np.log(X+1) but the recon_loss and kl_loss go to inf/nan after a couple of epochs.

Apr 07 '23 05:04 felix-yuxiang

Why the recon_loss is 0? This does not make any sense at all. I think you should check your data first. SCALE takes count matrix without any normalization as input. Also providing more information showed by SCALE also helps debug.

Apr 07 '23 12:04 jsxlei

Thank you for your quick and kind reply!

I will provide more info here. Since I am using bulk sequencing data, I cloned this repo and extract a few methods that I need. This is a glimpse of my data, all are raw counts and there is no nan.

Here is a detailed version of my code. Now the recon_loss and kl_loss go to inf/nan quickly. Could you shed some light on how to resolve this issue?

Apr 07 '23 18:04 felix-yuxiang

I just saw the max count can be 540254, and the smallest is 2 or 0 in your data. The numerical difference is big. You’d better do the log1p transformation to scale the number, since SCALE assume the input is binary. Another thing is that make sure the sample feature matrix right to take each sample as one input instead of each feature as one input.

On Fri, Apr 7, 2023 at 15:01 Felix Fu @.***> wrote:

Thank you for your quick and kind reply!

I will provide more info here. Since I am using bulk sequencing data, I cloned this repo and extract a few methods that I need. [image: image] https://user-images.githubusercontent.com/45562044/230659975-4aeda15d-d223-47b3-ad54-ab5769000fd9.png This is a glimpse of my data, all are raw counts and there is no nan.

Here is a detailed version https://codefile.io/f/n8MiA0yfk2J4wDI6qNmh of my code. Now the recon_loss and kl_loss go to inf/nan quickly. Could you shed some light on how to resolve this issue?

— Reply to this email directly, view it on GitHub https://github.com/jsxlei/SCALE/issues/25#issuecomment-1500550130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD5YJUMUW4CV7DN6WEEBCADXABP7XANCNFSM6AAAAAAWMPRBAM . You are receiving this because you commented.Message ID: @.***>

Apr 08 '23 11:04 jsxlei

SCALE
SCALE copied to clipboard

Question about gamma in scale_ELBOW

SCALE SCALE copied to clipboard

Question about gamma in scale_ELBOW

SCALE
SCALE copied to clipboard