turicreate
turicreate copied to clipboard
tc.object_detector.create gets "Killed"
I am trying to train an object detection model on my data, there are roughly about 36,000 train images right now and it may/may not increase later, and whenever I train my model it gets killed. To observer what is happening I parallelly was monitoring the output of htop
to administer what is happening and what I found out was whenever the code will reach tc.object_detector.create
you will start the RAM getting full and then swap memory gets full and then it gets killed.
My hunch was that it is the memory that is not being managed well somewhere, so to validate my claim I decreased my data size to 8000 images and as I expected it worked just fine. My issue is that the problem I am trying to solve cannot be solved with just 8000 images and would require more data for better accuracy. Then I looked up a few issues on stackover flow which talk a similar problem and one by one I tried to implement them to see if it would solve my problem. Below are a few solutions are tried:
- Decrease the batch size (or just make it 1): I tried all combinations of batch size and none of these worked
- Resize the images to smaller size: Even this trick did not work, as even with a smaller image dimension the data would still fill up the RAM and swap memory.
-
Followed the suggestion in #1412: Even this issue talks about something similar to mine and it was suggested that the latest turicreate probably would not have this issue. I upgraded my turicreate to now latest version:
6.4.1
and it still did not fix my issue - Increase the swap memory: I tried increasing my swap memory and this kind of gave relatively positive results. Like I claimed earlier that it might be a memory issue, so ideally increasing the swap memory should help me and it did. Below is a small analysis I did (keep in mind my original swap memory size was 2GB):
Swap memory size (in GB) | Max amount of data trained without error |
---|---|
2GB | 8k |
8GB | 12-14k |
16GB | 20k |
32GB | 25k |
Theoretically, I can keep on increasing the swap memory for more data I want to train but I think it's just a hack and we are not addressing the underlying issue that could have caused it. Also, just increasing the swap memory based on my data size is not a right thing to do for the system as well. Considering the fact that SFrame
is not a in-memory data structure like many others, I did not expect this problem to occur while training and I think it's very important to address this, as I feel a good ML tool should work for any amount of data that is given.
Pragmatically speaking, it is possible that it might be a because of the infra I am using so here are some details of the system I am working on:
OS: Ubuntu 18.04 CUDA: 10.0 Cudnn: 7.4.2 Tensorflow: 2.0.0 (built from scratch to use the above mentioned CUDA and Cudnn) Turicreate: 6.4.1 (by far the latest version) RAM: 16GB **I am using a virtual environment set up (if it matters)
Data size: 15 GB (it will eventually increase even more)
Any feedback on this would be appreciated. Thank you!
I faced the same issue then my workaround was to update CUDA to 10.1 and Cudnn to 7.6. After that install, Turicreate as per instructions. Follow GPU instruction as it is.
@waheed0332 But the recommended CUDA and Cudnn version is 10.0 and 7.4.x respectively with tensorflow 2.0.0.
Also, what is the size of the data you are dealing with? What is your tensorflow version? And did the RAM consumption decrease after this update?
Thanks!
Yeah, but I got the same error with the recommended version. My data size is 500k images I have 32GB ram.
Here is my requirement file. absl-py==0.10.0 astor==0.8.1 astunparse==1.6.3 cachetools==4.1.1 certifi==2020.6.20 chardet==3.0.4 coremltools==3.3 decorator==4.4.2 gast==0.3.3 google-auth==1.22.1 google-auth-oauthlib==0.4.1 google-pasta==0.2.0 grpcio==1.32.0 h5py==2.10.0 idna==2.10 importlib-metadata==2.0.0 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.2 llvmlite==0.33.0 Markdown==3.3.1 numba==0.50.1 numpy==1.18.5 oauthlib==3.1.0 opt-einsum==3.3.0 pandas==1.1.3 Pillow==8.0.0 pkg-resources==0.0.0 prettytable==0.7.2 protobuf==3.13.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 python-dateutil==2.8.1 pytz==2020.1 requests==2.24.0 requests-oauthlib==1.3.0 resampy==0.2.1 rsa==4.6 scipy==1.5.2 six==1.15.0 tensorboard==2.3.0 tensorboard-plugin-wit==1.7.0 tensorflow-estimator==2.3.0 tensorflow-gpu==2.3.1 termcolor==1.1.0 turicreate==6.4.1 urllib3==1.25.10 Werkzeug==1.0.1 wrapt==1.12.1 zipp==3.3.0
Cuda 10.1 and Cudnn 7.6 is compatible with tensorflow-gpu 2.3.1. So you can give it a try things might workout for you too.
@waheed0332 Thanks a lot, will give it a try and report back.
@waheed0332 So, I did try what you suggested and below are my findings:
- So I first set up CUDA 10.1 and Cudnn 7.6 and then I built tensorflow(V2.3.1) from scratch and then installed turicreate (V6.4.1, this is the latest version as of now) but turicreate would not install and threw an error saying
turicreate is only compatible with tensorflow>2.1 and tensorflow<=2.0
(This is not the actual error but it basically said something very similar to this). - Then I tried to trick turicrteate by uninstalling the tensorflow I had (which is 2.3.1) and then installed installed a compatible version of tensorflow(2.0.0) and then installed turicreate. Then finally uninstall tensorflow-2.0.0 and again installed tensorflow -2.3.1 and this did give me a warning saying it's not compatible with turicreate but I did it anyway to test it. But even after doing this I still ran into the same error I mentioned earlier (I know this step does sound very stupid but I did it only because I wanted to validate what @waheed0332 suggested)
Even now the RAM gets full and then starts filling my swap memory and then once both are full the program gets killed. I strongly believe somewhere the data management is not happening properly.
The Data size I am using is a lot (roughly 36k images; 10-15 GB) but I don't think that should be create the problem.
Please let me know what can be done or how can we debug this even further or which part could create this issue and if anyone is facing a similar issue and has found a workaround.
Thanks a lot!
Can you share your requirements file?
Here is the list of all packages with their respective versions
Package Version
absl-py 0.10.0 astor 0.8.1 astunparse 1.6.3 boto3 1.16.3 botocore 1.19.3 cachetools 4.1.1 certifi 2020.6.20 chardet 3.0.4 coremltools 3.3 decorator 4.4.2 gast 0.2.2 google-auth 1.22.1 google-auth-oauthlib 0.4.1 google-pasta 0.2.0 grpcio 1.33.1 h5py 2.10.0 idna 2.10 importlib-metadata 2.0.0 jmespath 0.10.0 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 llvmlite 0.33.0 Markdown 3.3.2 numba 0.50.1 numpy 1.18.5 oauthlib 3.1.0 opt-einsum 3.3.0 pandas 1.1.3 Pillow 8.0.1 pip 20.2.4 pkg-resources 0.0.0 prettytable 0.7.2 protobuf 3.13.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 python-dateutil 2.8.1 pytz 2020.1 requests 2.24.0 requests-oauthlib 1.3.0 resampy 0.2.1 rsa 4.6 s3transfer 0.3.3 scipy 1.4.1 setuptools 50.3.2 six 1.15.0 tensorboard 2.0.2 tensorboard-plugin-wit 1.7.0 tensorflow-estimator 2.0.1 tensorflow-gpu 2.0.0 termcolor 1.1.0 turicreate 6.4.1 urllib3 1.25.11 Werkzeug 1.0.1 wheel 0.35.1 wrapt 1.12.1 zipp 3.3.1
set tensorflow-gpu==2.3.1 and try again. Also, images are being resized to 608x608 so you can also pass images a little larger than that. This will be beneficial for faster training too.
My Tensorflow-gpu was in fact 2.3.1 and as that did not work I reverted it back to 2.0.0. And my dataset has a images of from all ranges of size and I don't think that will cause any issue as you mentioned it's being resized anyway. I really don't think it's the version issue, I think it has to do with something else.
Well, the only workaround is to reduce the image size and resize annotations accordingly. I recommend going no more than 650x650.
Try increasing the swap space... https://askubuntu.com/questions/178712/how-to-increase-swap-space
I have the same issue with the one shot object detector. I get usable models when I want to detect 10-15 objects but I actually want to detect about 100 and my process fills up memory (64GB, same result with 256GB swap) generates a few hundred GBs of augmented data on disk, and gets "Killed" after finishing augmentation, just before starting the training.
- I tried reducing the size of the backgrounds used for augmentation and if I do that, the training goes through but loss on validation data stabilises after very few iterations and the resulting model is extremely bad (sees nothing). It feels like all training data is loaded into RAM at some point....
tc 6.4.1, Nvidia stuff 10.1
As a workaround, does anyone know how to export the generated synthetic data (images and annotations) in a format usable in another training system?