Basset icon indicating copy to clipboard operation
Basset copied to clipboard

install_data.py requires more than 30 GiB of memory

Open ptn24 opened this issue 4 years ago • 2 comments

It seems like install_data.py -- in particular https://github.com/davek44/Basset/blob/6ae86b88a8df8607c58590ee84a3702662ac33dc/install_data.py#L113 and https://github.com/davek44/Basset/blob/master/src/seq_hdf5.py#L73 -- requires a lot of memory. It deterministically runs OOM on a GCE instance with 30 GiB of memory. After changing https://github.com/davek44/Basset/blob/6ae86b88a8df8607c58590ee84a3702662ac33dc/install_data.py#L113 to

seq_hdf5.py \
  -c \
  -t 71886 \
  -v 70000 \
  encode_roadmap.fa encode_roadmap_act.txt encode_roadmap.h5

install_data.py passes on the same machine. If the "-r" option is not required, then perhaps disable it by default and make it an option at the top. Otherwise, please document the memory requirements. Thank you

Additional details:

  • GCE machine type: n1-standard-8 instance
  • GCE image: c1-deeplearning-common-cu100-20200422

ptn24 avatar Apr 30 '20 06:04 ptn24

Thanks for pointing this out. The DNA sequences stored as float16's are responsible for the high memory usage. I originally made that choice to allow for storing N's as 0.25 vectors, but I haven't found that to be very helpful in the years since. I just pushed a new commit to store them as booleans, which cuts the memory usage in half. (Python unfortunately uses 8 bits for a boolean.)

I also removed the "-r" option, which permutes the sequences before dividing into train/valid/test according to the current order. I believe the script should always permute, since input BED files will often be nonrandomly ordered. I recommend that you regenerate your training dataset with the new commit.

davek44 avatar May 02 '20 17:05 davek44

Thank you, @davek44 ! Let me play around with the new commit

ptn24 avatar May 06 '20 08:05 ptn24