Basset
Basset copied to clipboard
install_data.py requires more than 30 GiB of memory
It seems like install_data.py -- in particular https://github.com/davek44/Basset/blob/6ae86b88a8df8607c58590ee84a3702662ac33dc/install_data.py#L113 and https://github.com/davek44/Basset/blob/master/src/seq_hdf5.py#L73 -- requires a lot of memory. It deterministically runs OOM on a GCE instance with 30 GiB of memory. After changing https://github.com/davek44/Basset/blob/6ae86b88a8df8607c58590ee84a3702662ac33dc/install_data.py#L113 to
seq_hdf5.py \
-c \
-t 71886 \
-v 70000 \
encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5
install_data.py passes on the same machine. If the "-r" option is not required, then perhaps disable it by default and make it an option at the top. Otherwise, please document the memory requirements. Thank you
Additional details:
- GCE machine type: n1-standard-8 instance
- GCE image: c1-deeplearning-common-cu100-20200422
Thanks for pointing this out. The DNA sequences stored as float16's are responsible for the high memory usage. I originally made that choice to allow for storing N's as 0.25 vectors, but I haven't found that to be very helpful in the years since. I just pushed a new commit to store them as booleans, which cuts the memory usage in half. (Python unfortunately uses 8 bits for a boolean.)
I also removed the "-r" option, which permutes the sequences before dividing into train/valid/test according to the current order. I believe the script should always permute, since input BED files will often be nonrandomly ordered. I recommend that you regenerate your training dataset with the new commit.
Thank you, @davek44 ! Let me play around with the new commit