program frozen
Dear authors
Your work is very exciting and I want to try out your code. I followed the instructions on readme, and is trying to run this example
CUDA_VISIBLE_DEVICES=0 python fixmatch.py --filters=32 --dataset=cifar10.3@40-1 --train_dir ./experiments/fixmatch
however my program get stucked at self.train_step forever...
I did installed the required environments as you pointed out in the readme.
Do you have any idea what's going on?
I

after i keyboard interrupt it, it gives following message:
KeyboardInterrupt Traceback (most recent call last): File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt ^CProcess ForkPoolWorker-868: Traceback (most recent call last): File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/pool.py", line 110, in worker task = get() File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/queues.py", line 351, in get with self._rlock: File "xxxxxx/miniconda2/envs/fixmatch_env/lib/python3.7/multiprocessing/synchronize.py", line 95, in enter return self._semlock.enter() KeyboardInterrupt
Is the GPU active when you're running the training? Or is that stalling too?
Is the GPU active when you're running the training? Or is that stalling too?
The GPU is stalling too... It's a Tesla V100, the CUDA version is 11.0 and Driver version 450.80.02 (btw I installed the specified requirements using conda, is this causing the issue?) Thank you
I just wanted to follow up to see if you worked anything out. I don't know if I have any ideas for what could be causing this with our code, but maybe you found a problem?
Hi Carlini
Really thankful that you followed up on the issue. I really have trouble figuring out the issue which prevents me from using the code.
I first suspect it is some environment issue, here is the environment i'm using: (I think i matched your specified requirements? I tried the code on both tesla a100 and p100, and still don't work). I would really be helpful if you are able to help me with it.
This file may be used to create an environment using:
$ conda create --name --file
platform: linux-64
_libgcc_mutex=0.1=main _tflow_select=2.1.0=gpu absl-py=0.11.0=pyhd3eb1b0_1 astor=0.8.1=py37h06a4308_0 blas=1.0=mkl c-ares=1.17.1=h27cfd23_0 ca-certificates=2021.1.19=h06a4308_0 certifi=2020.12.5=py37h06a4308_0 cudatoolkit=10.1.243=h6bb024c_0 cudnn=7.6.5=cuda10.1_0 cupti=10.1.168=0 cycler=0.10.0=pypi_0 cython=0.29.21=py37h2531618_0 easydict=1.9=pypi_0 gast=0.4.0=py_0 google-pasta=0.2.0=py_0 grpcio=1.31.0=py37hf8bcb03_0 h5py=2.10.0=py37hd6299e0_1 hdf5=1.10.6=hb1b8bf9_0 importlib-metadata=2.0.0=py_1 intel-openmp=2020.2=254 keras-applications=1.0.8=py_1 keras-preprocessing=1.1.2=pyhd3eb1b0_0 kiwisolver=1.3.1=pypi_0 ld_impl_linux-64=2.33.1=h53a641e_7 libedit=3.1.20191231=h14c3975_1 libffi=3.3=he6710b0_2 libgcc-ng=9.1.0=hdf63c60_0 libgfortran-ng=7.3.0=hdf63c60_0 libprotobuf=3.14.0=h8c45485_0 libstdcxx-ng=9.1.0=hdf63c60_0 markdown=3.3.3=py37h06a4308_0 matplotlib=3.3.4=pypi_0 mkl=2020.2=256 mkl-service=2.3.0=py37he8ac12f_0 mkl_fft=1.2.0=py37h23d657b_0 mkl_random=1.1.1=py37h0573a6f_0 ncurses=6.2=he6710b0_1 numpy=1.19.2=py37h54aff64_0 numpy-base=1.19.2=py37hfa32c7d_0 opencv-python=4.5.1.48=pypi_0 openssl=1.1.1i=h27cfd23_0 pandas=1.2.1=py37ha9443f7_0 pillow=8.1.0=pypi_0 pip=20.3.3=py37h06a4308_0 protobuf=3.14.0=py37h2531618_1 pyparsing=2.4.7=pypi_0 python=3.7.9=h7579374_0 python-dateutil=2.8.1=pyhd3eb1b0_0 pytz=2021.1=pyhd3eb1b0_0 readline=8.1=h27cfd23_0 scipy=1.6.0=py37h91f5cce_0 setuptools=52.0.0=py37h06a4308_0 six=1.15.0=py37h06a4308_0 sqlite=3.33.0=h62c20be_0 tensorboard=1.14.0=py37hf484d3e_0 tensorflow=1.14.0=gpu_py37h74c33d7_0 tensorflow-base=1.14.0=gpu_py37he45bfe2_0 tensorflow-estimator=1.14.0=py_0 tensorflow-gpu=1.14.0=h0d30ee6_0 termcolor=1.1.0=py37_1 tk=8.6.10=hbc83047_0 tqdm=4.56.2=pypi_0 werkzeug=1.0.1=pyhd3eb1b0_0 wheel=0.36.2=pyhd3eb1b0_0 wrapt=1.12.1=py37h7b6447c_1 xz=5.2.5=h7b6447c_0 zipp=3.4.0=pyhd3eb1b0_0 zlib=1.2.11=h7b6447c_3
Huh. Two ideas maybe:
-
Does CPU-only training work? It would be really slow, but it should at least not stall.
-
If you try to train a MixMatch model with the MixMatch codebase (https://github.com/google-research/mixmatch) does that work? It's very similar code, and this might help isolate if it's a problem with fixmatch or with the general environment.
I followed your suggestion, which provide me with some useful insights. So
- fixmatch code does work on cpu-only training (using another environment that has cpu version tensorflow).
- I also tried the MixMatch model you pointed to. MixMatch work on both tesla p100, v100, but not a100. (using the gpu tensorflow environment i showed you above)
So I guess
- there might be something unique to fixmatch that make fixmatch doesn't work on my p100 and v100 (while MixMatch works).
- There must be something wrong with my a100s.
For my concern now, I would be very happy if i can get fixmatch to work on my p100 or v100. I really don't understand what could have happened that make mixmatch able to run but fixmatch not able to run? Do you have any thoughts?
That's very interesting. This fixmatch codebase has an implementation of mixmatch. Does that also run properly?
If that works, then maybe try running fixmatch with something like --uratio=1 and see if that helps. Maybe it's the batch size that's the problem?
Hi Carlini
That's very weird, i cannot run the implementation of mixmatch from the fixmatch codebase (while i'm able to run the original mixmatch codebase). May i ask what GPU do you run the fixmatch codebase on? could it be the new code not compatible with some GPUs? (I have very limited knowledge about this, hope that it is not a ridiculous guess...)
That is very strange. We've never seen any issues with different GPUs in the past, and the two codebases are very similar.
Maybe @david-berthelot has some insight that I'm missing.
In your requirment.txt, there's no specific requirement for python version or cudatoolkit, cudnn, cuda version. Do I need to install any specific version of cudatoolkit, cudnn or cuda?