py-lmdb icon indicating copy to clipboard operation
py-lmdb copied to clipboard

lmdb stuck/hang when use multiprocessing process(pytorch dataloder).

Open Sander-houqi opened this issue 1 year ago • 2 comments

Affected Operating Systems

  • Linux

Affected py-lmdb Version

lmdb=1.4.1

py-lmdb Installation Method

pip install lmdb

Using bundled or distribution-provided LMDB library?

Bundled

Distribution name and LMDB library version

(0, 9, 29)

Machine "free -m" output

$ free -m

              total        used        free      shared  buff/cache   available
Mem:         515461      181423        8654        3241      325382      329357
Swap:             0           0           0

Other important machine info

linux version: Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020

os: Ubuntu 18.04.6 LTS

python version; 3.10.11 pytorch version:1.13.0+cu116

Describe Your Problem

I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4,it will stuck and stop the training step in start or Intermediate steps(eg;1500step), my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in __init__ function .

ps:

_init_db load in __init__ function is same error.


import lmdb
import six,random
from PIL import Image
import time
import pyarrow as pa
import logging
import traceback
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from torchvision import transforms


def mycollate_fn(batch):
    batch = list(filter(lambda x : not isinstance(x,type(None)), batch))
    return default_collate(batch)

class Laion_load(Dataset):
    def __init__(self, ann_paths):
        
        self.env = None
        self.length = 62363814
        self.ann_paths = ann_paths
        
        self.totensor = transforms.Compose(
                [   
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                ]
            )
        
    
    def _init_db(self):
            
        self.db_path = self.ann_paths[0]
        st  = time.time()
        self.env = lmdb.open(self.db_path, subdir=False,
                            readonly=True, lock=False,
                            readahead=False, meminit=False, max_readers=128)
        with self.env.begin(write=False) as txn:
            self.length =pa.deserialize(txn.get(b'__len__'))
            # self.keys= pa.deserialize(txn.get(b'__keys__'))
            
        end = time.time()
        logging.info("load time: {} ".format(end-st))
            
    def __getitem__(self, index):
        
        try:
            
            if self.env is None:
                self._init_db()
            
            encode_index = str(index).encode('ascii')
            
            with self.env.begin(write=False) as txn:
                byteflow = txn.get(encode_index)
                # byteflow = txn.get(self.keys[index])
            
            imagebuf, org_cap, gen_cap  = pa.deserialize(byteflow)
            del byteflow
            
            buf = six.BytesIO()
            buf.write(imagebuf)
            buf.seek(0)
            img = Image.open(buf).convert('RGB')
            
            img = self.totensor(img)
            
            return dict(input=img,
                    org_cap = org_cap,
                    gen_cap = gen_cap)
            
        except Exception as e:
            logging.error('index:{} Exception {}'.format(index,e))
            logging.error("error detail: {}".format(traceback.format_exc()))
            return None

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'
 

     
if __name__ == "__main__":
    
    test_data = Laion_load(ann_paths=["./test.lmdb"])
    
    data_loader = DataLoader(test_data,batch_size=100, num_workers=8,shuffle=True)
    for index,item in enumerate(data_loader):
        print('aa:',index)
        
    print("done")
    
    

Use multiprocess use deepspeed or torchrun like: deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json or torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py

Errors/exceptions Encountered

it stuck and no error with CTRL +C .

Describe What You Expected To Happen

I expected the transaction to commit successfully.

Describe What Happened Instead

The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.

Sander-houqi avatar Nov 06 '23 04:11 Sander-houqi

can anyone helps? @jnwatson @dw Thanks.

Sander-houqi avatar Nov 07 '23 02:11 Sander-houqi

maybe remove the try expect also do you see the load time.. print?

orena1 avatar Dec 11 '23 22:12 orena1