py-lmdb lmdb stuck/hang when use multiprocessing process(pytorch dataloder).

Affected Operating Systems

Linux

Affected py-lmdb Version

lmdb=1.4.1

py-lmdb Installation Method

pip install lmdb

Using bundled or distribution-provided LMDB library?

Bundled

Distribution name and LMDB library version

(0, 9, 29)

Machine "free -m" output

$ free -m

              total        used        free      shared  buff/cache   available
Mem:         515461      181423        8654        3241      325382      329357
Swap:             0           0           0

Other important machine info

linux version: Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020

os: Ubuntu 18.04.6 LTS

python version; 3.10.11 pytorch version:1.13.0+cu116

Describe Your Problem

I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4，it will stuck and stop the training step in start or Intermediate steps（eg;1500step）, my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in __init__ function .

ps:

_init_db load in __init__ function is same error.


import lmdb
import six,random
from PIL import Image
import time
import pyarrow as pa
import logging
import traceback
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from torchvision import transforms


def mycollate_fn(batch):
    batch = list(filter(lambda x : not isinstance(x,type(None)), batch))
    return default_collate(batch)

class Laion_load(Dataset):
    def __init__(self, ann_paths):
        
        self.env = None
        self.length = 62363814
        self.ann_paths = ann_paths
        
        self.totensor = transforms.Compose(
                [   
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                ]
            )
        
    
    def _init_db(self):
            
        self.db_path = self.ann_paths[0]
        st  = time.time()
        self.env = lmdb.open(self.db_path, subdir=False,
                            readonly=True, lock=False,
                            readahead=False, meminit=False, max_readers=128)
        with self.env.begin(write=False) as txn:
            self.length =pa.deserialize(txn.get(b'__len__'))
            # self.keys= pa.deserialize(txn.get(b'__keys__'))
            
        end = time.time()
        logging.info("load time: {} ".format(end-st))
            
    def __getitem__(self, index):
        
        try:
            
            if self.env is None:
                self._init_db()
            
            encode_index = str(index).encode('ascii')
            
            with self.env.begin(write=False) as txn:
                byteflow = txn.get(encode_index)
                # byteflow = txn.get(self.keys[index])
            
            imagebuf, org_cap, gen_cap  = pa.deserialize(byteflow)
            del byteflow
            
            buf = six.BytesIO()
            buf.write(imagebuf)
            buf.seek(0)
            img = Image.open(buf).convert('RGB')
            
            img = self.totensor(img)
            
            return dict(input=img,
                    org_cap = org_cap,
                    gen_cap = gen_cap)
            
        except Exception as e:
            logging.error('index:{} Exception {}'.format(index,e))
            logging.error("error detail: {}".format(traceback.format_exc()))
            return None

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'
 

     
if __name__ == "__main__":
    
    test_data = Laion_load(ann_paths=["./test.lmdb"])
    
    data_loader = DataLoader(test_data,batch_size=100, num_workers=8,shuffle=True)
    for index,item in enumerate(data_loader):
        print('aa:',index)
        
    print("done")

Use multiprocess use deepspeed or torchrun like: deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json or torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py

Errors/exceptions Encountered

it stuck and no error with CTRL +C .

Describe What You Expected To Happen

I expected the transaction to commit successfully.

Describe What Happened Instead

The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.

Nov 06 '23 04:11 Sander-houqi

can anyone helps? @jnwatson @dw Thanks.

Nov 07 '23 02:11 Sander-houqi

maybe remove the try expect also do you see the load time.. print?

Dec 11 '23 22:12 orena1

py-lmdb py-lmdb copied to clipboard

lmdb stuck/hang when use multiprocessing process(pytorch dataloder).

Affected Operating Systems

Affected py-lmdb Version

py-lmdb Installation Method

Using bundled or distribution-provided LMDB library?

Distribution name and LMDB library version

Machine "free -m" output

Other important machine info

Describe Your Problem

Errors/exceptions Encountered

Describe What You Expected To Happen

Describe What Happened Instead

py-lmdb
py-lmdb copied to clipboard