D-SCRIPT icon indicating copy to clipboard operation
D-SCRIPT copied to clipboard

loading h5 embeding file in train.py is killed due to large memory requirement

Open jiyanbio opened this issue 4 years ago • 8 comments

in train.py: print(f"# Loading embeddings", file=output) tensors = {} all_proteins = set(train_n0).union(set(train_n1)).union(set(test_n0)).union(set(test_n1)) for prot_name in tqdm(all_proteins): tensors[prot_name] = torch.from_numpy(h5fi[prot_name][:, :])

Can it be modified with pytorch dataloader?

jiyanbio avatar Mar 23 '21 14:03 jiyanbio

Hi Jiyan,

Feel free to modify the source code as you see fit to play around with it -- I find it's usually easier to pre-load the embeddings from a runtime point of view, but if you're running into memory constraints. One thing I've thought about adding is a flag for whether to pre-load the data or not. Let me know if this would be helpful!

With regard to the GPU memory issue, that's usually caused when one (or both) of the protein sequences is too long, so the computation graph doesn't fit within GPU memory constraints. In the paper, we limited training to sequences <=800 AA.

samsledje avatar Mar 25 '21 14:03 samsledje

Hi Jiyan,

Feel free to modify the source code as you see fit to play around with it -- I find it's usually easier to pre-load the embeddings from a runtime point of view, but if you're running into memory constraints. One thing I've thought about adding is a flag for whether to pre-load the data or not. Let me know if this would be helpful!

With regard to the GPU memory issue, that's usually caused when one (or both) of the protein sequences is too long, so the computation graph doesn't fit within GPU memory constraints. In the paper, we limited training to sequences <=800 AA.

But, after embeding sequences, the size of h5 files is 2000G. It's too huge to run on any machine.

jiyanbio avatar Mar 25 '21 14:03 jiyanbio

How many sequences are you trying to embed? I've done up to 20,000 and it doesn't take more than a few hundred GB. I can definitely work on adding functionality to load or compute each embedding in place, but this will of course slow down the training time.

samsledje avatar Mar 25 '21 15:03 samsledje

Hi Jiyan, Sam and I have discussed how to speed up the embedding access and, like Sam mentioned, we can make a change to load them one at a time or do them in batches. These choices will have different tradeoffs in terms of runtime and memory requirements. Just so we can understand your use-case, would you be ok sharing the max and avg length of your sequences, and the number of sequences?

thanks -r

On Thu, Mar 25, 2021 at 11:04 AM Samuel Sledzieski @.***> wrote:

How many sequences are you trying to embed? I've done up to 20,000 and it doesn't take more than a few hundred GB. I can definitely work on adding functionality to load or compute each embedding in place, but this will of course slow down the training time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samsledje/D-SCRIPT/issues/11#issuecomment-806911866, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSO3WO6F2AA3OP4AYVOQALTFNGIXANCNFSM4ZVIOZGA .

rs239 avatar Mar 25 '21 15:03 rs239

  • the number of sequenes is about 200,000 most of which are shorter than 800 bps.
  • I will exclude some longer seqences.
  • Moreover, even for 20,000 sequences as input, it was ran out of memory on my machine. So, if the trainp.py could handle 5~10 h5 files, It will be much more convenient for users to test it on different datasets.

jiyanbio avatar Mar 26 '21 02:03 jiyanbio

Hi Jiyan, Yes, the idea of splitting training data into chunks makes sense. We'll think about how best to set this up and will keep you posted.

thanks -r

On Thu, Mar 25, 2021 at 10:17 PM jiyanbio @.***> wrote:

  • the number of sequenes is about 200,000 most of which is shorter than 800 bp.
  • I will exclude some longer seqences.
  • Moreover, even for 20,000 sequences as input, it was ran out of memory on my machine. So, if the trainp.py could handle 5~10 h5 files, It will be much more convenient for users to test it on different datasets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samsledje/D-SCRIPT/issues/11#issuecomment-807881598, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSO3WKCCJIPDQTGPCKHJB3TFPVDNANCNFSM4ZVIOZGA .

rs239 avatar Mar 26 '21 21:03 rs239

Thank you! Could you tell me when the updated version for handling multiples hdf5 files will be online?

jiyanbio avatar Apr 06 '21 22:04 jiyanbio

Hi Jiyan,

I don't think we'll have the bandwidth to fix this for the next few weeks, at least. If you'd like to submit a pull request we'd be happy to take a look at it.

samsledje avatar Apr 08 '21 20:04 samsledje