CUTS icon indicating copy to clipboard operation
CUTS copied to clipboard

Retina-generate_kmeans

Open 996237275 opened this issue 2 years ago • 10 comments

Hello! I finished training the encoder on retina dataset, however, there went some problem when I wanted to use generate_kmeans and generate_diffusion. I tried the tips that you mentioned about the 'deadlock', but it still cannot work. The only script can work is generate_baseline.py.

996237275 avatar Jun 30 '23 09:06 996237275

Hi.

By far, the known problem is that kmeans and diffusion requires some decent RAM (e.g., on slurm, I would need to set --mem=20G for it to successfully run). If you are using a service like slurm, you may need to request more memory.

If that is not the issue,

  1. Did you try using the latest code?
  2. Can you give more details (perhaps screenshots) of the error?

ChenLiu-1996 avatar Jun 30 '23 13:06 ChenLiu-1996

I use a single A100 for experiment. Maybe it will not be the problem? I donot find --mem=20G in generate_xx,py 1.yes, I updated code already. 2.

53a5d8c59cb723deac008c2ab71565e

996237275 avatar Jun 30 '23 14:06 996237275

Oh I might have said something confusing. --mem=20G is the setting for running a slurm job on a server. It means we need to request 20GB of RAM for that job in order to run the script successfully. If you don't have enough RAM it may be a problem. But in most cases if you are running on a server that does not have job allocation, you shall have more than enough RAM

A single A100 shall be more than enough.

Regarding your screenshot: What if you do not use the --rerun argument? I recent found that to be more helpful.

ChenLiu-1996 avatar Jun 30 '23 15:06 ChenLiu-1996

Acctually, I tried --rerun already. However it cannot work. 27ec6cce36aa8797089c6944169347d

996237275 avatar Jul 03 '23 02:07 996237275

When using generate_diffusion.py, it looks like get into deadlock also? 23020f61201557b8c244d7d68acd984

996237275 avatar Jul 03 '23 02:07 996237275

Unfortunately I don't really understand the root cause of the problem.

So far the setting that works on my end is:

  • Do NOT use the --rerun flag.
  • Make sure you have around 20GB of RAM. (I have not tested to see the limits, but what I can say is that <=10GB of RAM will not work on my server).

ChenLiu-1996 avatar Jul 03 '23 03:07 ChenLiu-1996

If this still does not work, the following resources might be helpful.

  • You may try running export MKL_THREADING_LAYER=GNU before running the generate_diffusion.py or generate_kmeans.py.
  • Additional resources at https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

ChenLiu-1996 avatar Jul 03 '23 03:07 ChenLiu-1996

I set the export MKL_THREADING_LAYER=GNU on Linux, but it still remain 'deadlock 051bc94f2ff592076a32c48ce6648a0

996237275 avatar Jul 03 '23 03:07 996237275

Thanks for trying these out. At this moment I am basically clueless. Sorry for not being able to be more helpful.

I am still suspecting it's a RAM problem but I don't have a solid proof.

ChenLiu-1996 avatar Jul 05 '23 07:07 ChenLiu-1996

I have updated the code for generating kmeans and diffusion condensation. I believe it may be good now?

ChenLiu-1996 avatar Feb 22 '24 22:02 ChenLiu-1996