simrdwn
simrdwn copied to clipboard
Trained yolt stopped working?!
I'm scratching my head with this one. Back in March I had successfully trained Yolt using the COWC data and got some good test results on a separate data set.
Coming back a month later, I've tried to re-run the same config and can't get the same results! Probabilites are very low <0.01. The only thing that changed was a swap out of the Graphics card to upgrade to a Titan. Could this make a difference?
I was wondering if this was in anyway related to #26
Some additional info. running on the COWC test data I get the same low probability result but if I set the threshold to 0.01 I see the following, all the "detections" seem to be in rows at the bottom of each 544 pixel slice.
I had experienced a similar problem. I checked line 9-14 in /simrdwn/yolt/Makefile. My GPU did not match with any of them, so I added one. Also, I changed the version of CUDA and TensorFlow in /simrdwn/docker/Dockerfile, and I reinstalled SIMRDWN. You can find more information on "https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/".
@ghghgh777 I ran into a similar problem and your answer is very helpful! Can you share more details on how you modified the /simrdwn/docker/Dockerfile? Thanks!
@wendyzzzw Sorry for late reply. Well, I checked my answer only with the commit b275a35, so it may not work for the current commit.
Check "https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/". If you are using GPU matched with SM62, SM70, or SM75, you need to add a line like "-gencode arch=compute_62,code=[sm_62,compute_62]" after line 9-13 in /simrdwn/yolt2/Makefile and /simrdwn/yolt3/Makefile.
For /simrdwn/docker/Dockerfile, the code was updated, so it uses CUDA 9.0. I think CUDA 10 would be required if your GPU is matched with SM75. If you need to use CUDA 10, you can change line 2 and line 25-26. The current version of SIMRDWN uses tensorflow-gpu 1.13.1, so I think it would be OK.
After that, I reinstalled SIMRDWN from "0-3. Build docker file".
As noted by @ghghgh777, this seems to be a gpu architecture issue, and has been observed in YOLO as well: https://github.com/pjreddie/darknet/issues/486. I'm still digging into the issue, but it seems that there may be a compatibility issue with weights trained on older versions of CUDA. As painful as it seems, retraining the model with the new hardware/drivers worked for me to get around this issue.