deepEMhancer icon indicating copy to clipboard operation
deepEMhancer copied to clipboard

Job is stuck at the beginning forever while the GPUs are fully loaded

Open rui--zhang opened this issue 2 years ago • 5 comments

Hi I am trying the latest version of deepEMhancer. The installation was successful. However, the job seems stuck at the beginning forever (see the screenshot below), while the 4 GPUs (Nvidia RTX 3090) are 100% loaded. My Cuda version is cuda_11.2.r11.2/compiler.29618528_0 Any suggestion is greatly appreciated! Thanks!

Screenshot 2023-09-09 at 12 23 34 PM

Screenshot 2023-09-09 at 12 23 16 PM

rui--zhang avatar Sep 09 '23 17:09 rui--zhang

Hi,

Can you try first using only one gpu (e.g., -g 1)?

  • In addition, what is your OS?
  • Can you paste the content of pip freeze here so I can check versions?

rsanchezgarc avatar Sep 11 '23 09:09 rsanchezgarc

Hi, My OS is Ubuntu 20.04.5 LTS When using only one gpu, the symptom is exactly the same (not progressing, GPU load ~100%)

Here is the result of pip freeze inside deepEMhancer_env environment:

(deepEMhancer_env) ruiz@panda:/data/deepEM_dir$ pip freeze absl-py==1.4.0 astunparse==1.6.3 beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1680888073205/work boltons @ file:///home/conda/feedstock_root/build_artifacts/boltons_1677499911949/work Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1693583441880/work cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work cachetools==5.3.1 certifi==2023.7.22 cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1671179360775/work chardet @ file:///home/conda/feedstock_root/build_artifacts/chardet_1692221558316/work charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1688813409104/work click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work conda @ file:///home/conda/feedstock_root/build_artifacts/conda_1692727112122/work conda-build @ file:///home/conda/feedstock_root/build_artifacts/conda-build_1685027997092/work conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1691048088238/work conda_index @ file:///home/conda/feedstock_root/build_artifacts/conda-index_1670248776663/work conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1691009212940/work cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1691444168825/work deepEMhancer @ file:///data/bin2/deepEMhancer filelock @ file:///home/conda/feedstock_root/build_artifacts/filelock_1693242237773/work flatbuffers==23.5.26 gast==0.4.0 glob2==0.7 google-auth==2.21.0 google-auth-oauthlib==1.0.0 google-pasta==0.2.0 grpcio==1.56.0 h5py @ file:///home/conda/feedstock_root/build_artifacts/h5py_1692668123877/work idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work imageio==2.31.3 importlib-metadata==6.8.0 jax==0.4.13 Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1691577114857/work jsonpatch @ file:///home/conda/feedstock_root/build_artifacts/jsonpatch_1632759296524/work jsonpointer==2.0 keras==2.12.0 lazy_loader==0.3 libarchive-c @ file:///home/conda/feedstock_root/build_artifacts/python-libarchive-c_1689699461518/work libclang==16.0.0 Markdown==3.4.3 MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1685769048265/work ml-dtypes==0.2.0 more-itertools @ file:///home/conda/feedstock_root/build_artifacts/more-itertools_1691086935839/work mrcfile==1.4.0 networkx==3.1 numpy==1.23.0 nvidia-cublas-cu11==2022.4.8 nvidia-cublas-cu117==11.10.1.25 nvidia-cudnn-cu11==8.6.0.163 oauthlib==3.2.2 opt-einsum==3.3.0 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1681337016113/work Pillow==10.0.0 pkginfo @ file:///home/conda/feedstock_root/build_artifacts/pkginfo_1673281726124/work pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1693086607691/work protobuf==4.23.3 psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1681775019467/work pyasn1==0.5.0 pyasn1-modules==0.3.0 pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666836642684/work pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1685514481738/work PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1693930252784/work PyWavelets==1.4.1 PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1692737146376/work requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1684774241324/work requests-oauthlib==1.3.1 rsa==4.9 ruamel.yaml @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml_1686993888032/work ruamel.yaml.clib @ file:///home/conda/feedstock_root/build_artifacts/ruamel.yaml.clib_1670412733608/work scikit-image==0.20.0 scipy==1.9.0 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1693929250441/work tensorboard==2.12.3 tensorboard-data-server==0.7.1 tensorflow==2.12.0 tensorflow-estimator==2.12.0 tensorflow-io-gcs-filesystem==0.32.0 termcolor==2.3.0 tifffile==2023.8.30 tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1644342247877/work toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1691580802211/work typing_extensions==4.7.0 urllib3==1.26.16 Werkzeug==2.3.7 wrapt==1.14.1 zipp==3.16.2 zstandard @ file:///home/conda/feedstock_root/build_artifacts/zstandard_1667296091122/work

rui--zhang avatar Sep 11 '23 15:09 rui--zhang

Hi,

I just reinstalled it from scratch on Ubuntu 20.04 and one RTX 3000 (which should be the same generation) and I don't see any issues. It is true that the first iteration takes substantially more time than the others (1 min or so for my GPU on a 500x500x500 vol. The larger the volume the slower). How long did you wait for the first iteration to happen when using one single GPU? It is critical to understand if it is a problem affecting the multi-GPU mode (that has been reported in some cases) or if it is something affecting something else.

If it is not working at all, it is probably a Tensorflow-related issue, so we would need to ensure it is properly working.

rsanchezgarc avatar Sep 12 '23 10:09 rsanchezgarc

Sorry for the delay ~(just came back from a conference) I waited for 15 mins but the progression bar didn't move at all. Same thing if I use a single GPU 0.

rui--zhang avatar Sep 25 '23 15:09 rui--zhang

Hi, I am sorry to hear that it does not work. I would need to reproduce your conditions to understand what is going on because in my local installation it seems to work. But before we do that, can you try to use your local installation on a different map? Or if you don't have a different map, just try to crop a bit your map, so that the box size is different. If there is a weird bug related to resizing or padding, it should work as soon as the box size is different.

If it turns out that it works for other box sizes, can you tell me which is the box size and sampling rate of the map that is not working?

Finally, could you create a conda environment file and send it to me so that I can reproduce your environment?

conda activate deepemhancer_env
conda env export > environment.yml #This is a small text file
conda pack -n <your_environment_name> -o <output_file_name.tar.gz> #This is a large file

rsanchezgarc avatar Sep 27 '23 21:09 rsanchezgarc