ProtFlash icon indicating copy to clipboard operation
ProtFlash copied to clipboard

batchConverter uses up a lot of RAM

Open joelmeili opened this issue 2 years ago • 10 comments

Is there a way to run batchConverter in a way to not use up a lot of RAM? When I'm trying to run it, it uses up all RAM and then crashes. The issue seems to come from the pad_sequence when there are a lot of proteins.

joelmeili avatar Jul 27 '23 13:07 joelmeili

hi @joelmeili, Can you show your example and error code? In theory, batchConverter does not take up a lot of memory.

wangleiofficial avatar Jul 27 '23 16:07 wangleiofficial

Hi, Thanks for responding! So I would like to "predict"/"calculate" the embeddings for a list of amino acid sequences, which I extract from a .fasta-file. So the corresponding .fasta-file can be downloaded from here. Attached you can find a python file with the code used. The following is the error message that gets thrown: RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 40255618000 bytes. Error code 12 (Cannot allocate memory) May I also ask you whether there's a pre-print available explaining the algorithm?

main.zip

joelmeili avatar Jul 27 '23 18:07 joelmeili

hi, I found that you have a large number of fasta sequences, which may be caused by insufficient GPU memory. I recommend that you compute protien sequence embeddings in small batches (e.g. run once every 64 sequences), or you can directly integrate ProtFlash into your Pytorch network (in our experiments, this method works best), so that there is no need to store so much sequence embedding information, and in addition, the model of ProtFlash can be fine-tuned. Regarding our paper, our manuscript is under review and is expected to be released shortly. If you need to cite our paper now, we consider putting the paper in a preprint.

wangleiofficial avatar Jul 28 '23 02:07 wangleiofficial

Hi again, Yeah makes sense. Where would you put the batchConverter pipeline? I've prepared it now in the following manner:

`from Bio import SeqIO from ProtFlash.pretrain import load_prot_flash_base from ProtFlash.utils import batchConverter from torch.utils.data import Dataset, DataLoader import torch, numpy

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print("Using device: {}".format(device))

model = load_prot_flash_base().to(device) fasta_file = "train_sequences.fasta"

class ProteinSequenceDataset(Dataset): def init(self, data_batch): fasta_parsed = SeqIO.parse(open(fasta_file), "fasta") data = [(entry.id, str(entry.seq)) for entry in fasta_parsed] self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    data = [(protein_id, seq)]
    ids, batch_token, lengths = batchConverter(data)
    
    with torch.no_grad():
        token_embedding = model(batch_token.to(device), lengths)
        
    seq_representation = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]
    
    return ids[0], seq_representation[0]

train_set = ProteinSequenceDataset(fasta_file) train_loader = DataLoader(train_set, batch_size = 64, shuffle = False)

print(next(iter(train_loader)))`

But I guess there might be a smarter way to go about it, for example, how can you apply the transformation in getitem on batch level instead of individual entry level? Best, Joël

joelmeili avatar Jul 28 '23 14:07 joelmeili

Okay I think I found this workaround, is this how you'd write it aswell?

`from Bio import SeqIO from ProtFlash.pretrain import load_prot_flash_base from ProtFlash.utils import batchConverter from torch.utils.data import Dataset, DataLoader import torch, numpy, pandas

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print("Using device: {}".format(device))

model = load_prot_flash_base().to(device) fasta_file = "/kaggle/input/cafa-5-protein-function-prediction/Train/train_sequences.fasta"

class ProteinSequenceDataset(Dataset): def init(self, data_batch): fasta_parsed = SeqIO.parse(open(fasta_file), "fasta") data = [(entry.id, str(entry.seq)) for entry in fasta_parsed] self.id, self.seq = map(list, zip(*data))

def __len__(self):
    return len(self.seq)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()
        
    protein_id, seq = self.id[idx], self.seq[idx]
    
    return protein_id, seq

def collate_fn(data): protein_ids, seqs = zip(*data) data = [(protein_ids[idx], seqs[idx]) for idx in range(len(protein_ids))]

ids, batch_token, lengths = batchConverter(data)

with torch.no_grad():
    token_embedding = model(batch_token.to(device), lengths)

seq_representations = [token_embedding[i, 0:len(seq) + 1].mean(0) for i, (_, seq) in enumerate(data)]

print(seq_representations)

train_set = ProteinSequenceDataset(fasta_file) train_loader = DataLoader(train_set, batch_size = 2, shuffle = False, collate_fn = collate_fn)

print(next(iter(train_loader)))`

joelmeili avatar Jul 28 '23 14:07 joelmeili

@joelmeili Yes, I think your code is reasonable, but I suggest you can finetune the language model, which will bring huge benefits.

Example:

model = your_model()
flash_model = load_prot_flash_base()
.....
optimizer = torch.optim.Adam([{'params': model.parameters()}, {'params': flash_model.parameters(), 'lr':1e-5},], lr=your_model lr)

wangleiofficial avatar Jul 28 '23 14:07 wangleiofficial

@wangleiofficial Thanks! I'll look into it when I get to that point. Thanks so far!

joelmeili avatar Jul 28 '23 14:07 joelmeili

hi @wangleiofficial,

so I tried to implement the model within a different model trying to predict protein functions, it seems to work for CPU, but when I try to use the CUDA framework it seems to not work anymore. Essentially I get this error message: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm) I tried to look up where the tensors might be attached to the cpu, but could not figure it out. Maybe you could have a quick look at my code and give me some ideas?

joelmeili avatar Jul 30 '23 00:07 joelmeili

Hello, the pytorch lighting framework does not place the batch_token processed by the batchConverter function on the GPU, you need to implement it manually:

batch_token = batch_token.to(self.device)

If you have any problems using ProtFlash, you can contact me and I will be happy to respond. Hope ProtFlash can help in the field of protein!

wangleiofficial avatar Jul 30 '23 14:07 wangleiofficial

Cool, thanks! I did what you proposed, but also had to put the flash model itself on to (self.device) in the forward step.

joelmeili avatar Jul 30 '23 16:07 joelmeili