Efficient-PyTorch icon indicating copy to clipboard operation
Efficient-PyTorch copied to clipboard

不使用DDP,只用lmdb,速度很慢,比原始imread还慢

Open Edwardmark opened this issue 4 years ago • 4 comments

def folder2lmdb(anno_file, name="train", write_frequency=5000, num_workers=16):
    ids = []
    annotation = []
    for line in open(anno_file,'r'):
        filename = line.strip().split()[0]
        ids.append(filename)
        annotation.append(line.strip().split()[1:])
    lmdb_path = osp.join("app_%s.lmdb" % name)
    isdir = os.path.isdir(lmdb_path)

    print("Generate LMDB to %s" % lmdb_path)
    db = lmdb.open(lmdb_path, subdir=isdir,
                   map_size=1099511627776 * 2, readonly=False,
                   meminit=False, map_async=True)
    
    print(len(ids), len(annotation))
    txn = db.begin(write=True)
    idx = 0
    for filename, label in zip(ids, annotation):
        print(filename, label)
        image = raw_reader(filename)
        txn.put(u'{}'.format(idx).encode('ascii'), dumps_pyarrow((image, label)))
        if idx % write_frequency == 0:
            print("[%d/%d]" % (idx, len(annotation)))
            txn.commit()
            txn = db.begin(write=True)
        idx += 1

    # finish iterating through dataset
    txn.commit()
    keys = [u'{}'.format(k).encode('ascii') for k in range(idx + 1)]
    with db.begin(write=True) as txn:
        txn.put(b'__keys__', dumps_pyarrow(keys))
        txn.put(b'__len__', dumps_pyarrow(len(keys)))

    print("Flushing database ...")
    db.sync()
    db.close()

class DetectionLMDB(data.Dataset):
    def __init__(self, db_path, transform=None, target_transform=None, dataset_name='WiderFace'):
        self.db_path = db_path
        self.env = lmdb.open(db_path, subdir=osp.isdir(db_path),
                             readonly=True, lock=False,
                             readahead=False, meminit=False)
        with self.env.begin(write=False) as txn:
            # self.length = txn.stat()['entries'] - 1
            self.length =pa.deserialize(txn.get(b'__len__'))
            self.keys= pa.deserialize(txn.get(b'__keys__'))

        self.transform = transform
        self.target_transform = target_transform


        self.name = dataset_name
        self.annotation = list()
        self.counter = 0

    def __getitem__(self, index):
        im, gt, h, w = self.pull_item(index)
        return im, gt

    def pull_item(self, index):
        img, target = None, None
        env = self.env
        with env.begin(write=False) as txn:
            byteflow = txn.get(self.keys[index])
        unpacked = pa.deserialize(byteflow)

        # load image
        imgbuf = unpacked[0]
        buf = six.BytesIO()
        buf.write(imgbuf)
        buf.seek(0)
        img = Image.open(buf).convert('RGB')
        img = cv2.cvtColor(np.asarray(img),cv2.COLOR_RGB2BGR)  
        height, width, channels = img.shape
        # load label
        target = unpacked[1]

        if self.target_transform is not None:
            target = self.target_transform(target, width, height)

        if self.transform is not None:
            target = np.array(target)
            img, boxes, labels, poses, angles = self.transform(img, target[:, :4], target[:, 4], target[:,5], target[:,6])
            target = np.hstack((boxes, np.expand_dims(labels, axis=1),
                                       np.expand_dims(poses, axis=1),
                                       np.expand_dims(angles, axis=1)))

        return torch.from_numpy(img).permute(2, 0, 1), target, height, width

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'

使用上述代码生成lmdb并用DetectionLMDB作为dataset,速度很慢,不知道为啥,是不是必须跟DDP混合使用呢?

Edwardmark avatar Aug 10 '20 09:08 Edwardmark

我也有同样的问题,比自带的Dataset慢,请问你后来解决了吗?

codermckee avatar Nov 17 '20 12:11 codermckee

这个 scripts 是很早以前(~ torch 0.4)的遗产了,不知道现在 dataloader 有没有什么改动,我来测试下看看。

Lyken17 avatar Nov 17 '20 17:11 Lyken17

我一开始的时候把预处理的特征存进去了,特征太大了所以加载的时候慢,后来把单张图片的buffer存进去就很快

leijuzi avatar Aug 05 '21 14:08 leijuzi

@Lyken17 Hi, I first tried using pytorch-1.10(cuda10.2, python3.8) on one GPU(1080ti), it is too slow, the log is as follows

Epoch: [0][0/10010]     Time 15.811 (15.811)    Data 14.544 (14.544)    Loss 7.0312 (7.0312)    Acc@1 0.000 (0.000)     Acc@5 0.781 (0.781)
Epoch: [0][10/10010]    Time 0.213 (4.024)      Data 0.000 (3.770)      Loss 7.3495 (7.1619)    Acc@1 0.000 (0.213)     Acc@5 0.000 (0.497)
Epoch: [0][20/10010]    Time 10.217 (4.017)     Data 10.129 (3.817)     Loss 7.2931 (7.2333)    Acc@1 0.000 (0.223)     Acc@5 0.000 (0.595)
Epoch: [0][30/10010]    Time 0.213 (3.740)      Data 0.000 (3.556)      Loss 7.0012 (7.1996)    Acc@1 0.000 (0.176)     Acc@5 0.000 (0.580)
Epoch: [0][40/10010]    Time 8.978 (3.729)      Data 8.890 (3.556)      Loss 7.0080 (7.1619)    Acc@1 0.781 (0.210)     Acc@5 0.781 (0.534)
Epoch: [0][50/10010]    Time 0.220 (3.661)      Data 0.000 (3.492)      Loss 6.9565 (7.1282)    Acc@1 0.000 (0.199)     Acc@5 0.000 (0.597)
Epoch: [0][60/10010]    Time 7.797 (3.635)      Data 7.710 (3.471)      Loss 6.9137 (7.0951)    Acc@1 0.000 (0.218)     Acc@5 0.000 (0.602)
Epoch: [0][70/10010]    Time 0.214 (3.665)      Data 0.000 (3.503)      Loss 6.9065 (7.0728)    Acc@1 0.000 (0.220)     Acc@5 0.000 (0.572)
Epoch: [0][80/10010]    Time 7.347 (3.636)      Data 7.260 (3.477)      Loss 6.8719 (7.0524)    Acc@1 0.000 (0.212)     Acc@5 0.781 (0.637)
Epoch: [0][90/10010]    Time 0.216 (3.590)      Data 0.000 (3.431)      Loss 6.9107 (7.0356)    Acc@1 0.000 (0.206)     Acc@5 0.781 (0.687)
Epoch: [0][100/10010]   Time 9.313 (3.629)      Data 9.219 (3.473)      Loss 6.9006 (7.0217)    Acc@1 0.000 (0.217)     Acc@5 0.000 (0.696)
Epoch: [0][110/10010]   Time 0.212 (3.577)      Data 0.000 (3.421)      Loss 6.8484 (7.0093)    Acc@1 0.000 (0.211)     Acc@5 2.344 (0.739)
Epoch: [0][120/10010]   Time 11.809 (3.600)     Data 11.722 (3.445)     Loss 6.8965 (6.9977)    Acc@1 0.781 (0.213)     Acc@5 1.562 (0.781)
Epoch: [0][130/10010]   Time 0.215 (3.534)      Data 0.000 (3.379)      Loss 6.8403 (6.9883)    Acc@1 0.000 (0.209)     Acc@5 0.000 (0.805)
Epoch: [0][140/10010]   Time 11.093 (3.551)     Data 11.000 (3.400)     Loss 6.9016 (6.9800)    Acc@1 0.000 (0.199)     Acc@5 0.000 (0.803)
Epoch: [0][150/10010]   Time 4.364 (3.523)      Data 4.276 (3.373)      Loss 6.8721 (6.9722)    Acc@1 0.000 (0.191)     Acc@5 0.000 (0.771)
Epoch: [0][160/10010]   Time 9.092 (3.525)      Data 9.004 (3.375)      Loss 6.8635 (6.9640)    Acc@1 0.000 (0.199)     Acc@5 0.781 (0.791)
Epoch: [0][170/10010]   Time 5.724 (3.507)      Data 5.637 (3.359)      Loss 6.8689 (6.9573)    Acc@1 0.000 (0.201)     Acc@5 0.781 (0.777)
Epoch: [0][180/10010]   Time 9.218 (3.506)      Data 9.124 (3.360)      Loss 6.7048 (6.9496)    Acc@1 0.781 (0.207)     Acc@5 3.125 (0.803)
Epoch: [0][190/10010]   Time 3.789 (3.481)      Data 3.700 (3.335)      Loss 6.8398 (6.9441)    Acc@1 0.000 (0.209)     Acc@5 0.000 (0.826)
Epoch: [0][200/10010]   Time 11.521 (3.492)     Data 11.433 (3.347)     Loss 6.8196 (6.9367)    Acc@1 0.000 (0.218)     Acc@5 0.000 (0.875)
Epoch: [0][210/10010]   Time 1.611 (3.465)      Data 1.523 (3.321)      Loss 6.7499 (6.9297)    Acc@1 2.344 (0.233)     Acc@5 2.344 (0.896)
Epoch: [0][220/10010]   Time 11.472 (3.480)     Data 11.383 (3.337)     Loss 6.7838 (6.9230)    Acc@1 0.781 (0.255)     Acc@5 1.562 (0.937)
Epoch: [0][230/10010]   Time 0.212 (3.443)      Data 0.000 (3.299)      Loss 6.8092 (6.9169)    Acc@1 0.000 (0.257)     Acc@5 0.781 (0.944)
Epoch: [0][240/10010]   Time 10.698 (3.472)     Data 10.610 (3.328)     Loss 6.8725 (6.9105)    Acc@1 0.000 (0.253)     Acc@5 0.000 (0.969)
Epoch: [0][250/10010]   Time 0.217 (3.451)      Data 0.000 (3.307)      Loss 6.8506 (6.9055)    Acc@1 0.000 (0.246)     Acc@5 0.000 (0.980)
Epoch: [0][260/10010]   Time 9.317 (3.456)      Data 9.229 (3.312)      Loss 6.7118 (6.9010)    Acc@1 0.000 (0.263)     Acc@5 1.562 (0.988)
Epoch: [0][270/10010]   Time 0.212 (3.439)      Data 0.000 (3.295)      Loss 6.7731 (6.8963)    Acc@1 0.781 (0.277)     Acc@5 1.562 (1.038)
Epoch: [0][280/10010]   Time 11.279 (3.458)     Data 11.191 (3.314)     Loss 6.8488 (6.8909)    Acc@1 0.000 (0.286)     Acc@5 0.781 (1.054)
Epoch: [0][290/10010]   Time 0.214 (3.436)      Data 0.000 (3.292)      Loss 6.7565 (6.8860)    Acc@1 0.000 (0.290)     Acc@5 0.781 (1.079)
Epoch: [0][300/10010]   Time 12.405 (3.458)     Data 12.317 (3.313)     Loss 6.7233 (6.8805)    Acc@1 0.000 (0.298)     Acc@5 1.562 (1.121)
Epoch: [0][310/10010]   Time 0.213 (3.426)      Data 0.000 (3.282)      Loss 6.7484 (6.8755)    Acc@1 0.000 (0.306)     Acc@5 2.344 (1.156)
Epoch: [0][320/10010]   Time 13.653 (3.442)     Data 13.559 (3.298)     Loss 6.7439 (6.8712)    Acc@1 0.000 (0.309)     Acc@5 1.562 (1.173)
Epoch: [0][330/10010]   Time 0.212 (3.418)      Data 0.000 (3.273)      Loss 6.7267 (6.8670)    Acc@1 0.781 (0.314)     Acc@5 2.344 (1.204)
Epoch: [0][340/10010]   Time 13.209 (3.435)     Data 13.121 (3.289)     Loss 6.7553 (6.8636)    Acc@1 1.562 (0.314)     Acc@5 1.562 (1.210)
Epoch: [0][350/10010]   Time 0.218 (3.412)      Data 0.000 (3.265)      Loss 6.6885 (6.8588)    Acc@1 0.000 (0.318)     Acc@5 3.125 (1.249)
Epoch: [0][360/10010]   Time 12.825 (3.427)     Data 12.731 (3.280)     Loss 6.6241 (6.8540)    Acc@1 0.000 (0.316)     Acc@5 2.344 (1.260)
Epoch: [0][370/10010]   Time 0.213 (3.408)      Data 0.000 (3.260)      Loss 6.8046 (6.8504)    Acc@1 0.000 (0.322)     Acc@5 1.562 (1.289)
Epoch: [0][380/10010]   Time 11.702 (3.415)     Data 11.615 (3.267)     Loss 6.7234 (6.8459)    Acc@1 1.562 (0.334)     Acc@5 2.344 (1.312)
Epoch: [0][390/10010]   Time 0.218 (3.399)      Data 0.000 (3.249)      Loss 6.7012 (6.8410)    Acc@1 0.000 (0.342)     Acc@5 2.344 (1.343)
Epoch: [0][400/10010]   Time 12.231 (3.413)     Data 12.144 (3.264)     Loss 6.7159 (6.8370)    Acc@1 0.000 (0.343)     Acc@5 1.562 (1.356)
Epoch: [0][410/10010]   Time 0.213 (3.396)      Data 0.000 (3.245)      Loss 6.5088 (6.8320)    Acc@1 0.000 (0.348)     Acc@5 3.125 (1.382)
Epoch: [0][420/10010]   Time 12.972 (3.407)     Data 12.883 (3.256)     Loss 6.6504 (6.8275)    Acc@1 0.781 (0.349)     Acc@5 4.688 (1.403)
Epoch: [0][430/10010]   Time 0.212 (3.393)      Data 0.000 (3.242)      Loss 6.6490 (6.8246)    Acc@1 0.000 (0.352)     Acc@5 3.906 (1.434)
Epoch: [0][440/10010]   Time 11.984 (3.406)     Data 11.896 (3.255)     Loss 6.7207 (6.8209)    Acc@1 0.781 (0.358)     Acc@5 1.562 (1.465)
Epoch: [0][450/10010]   Time 0.212 (3.387)      Data 0.000 (3.235)      Loss 6.5495 (6.8161)    Acc@1 0.000 (0.357)     Acc@5 0.000 (1.483)
Epoch: [0][460/10010]   Time 11.841 (3.396)     Data 11.748 (3.244)     Loss 6.6327 (6.8123)    Acc@1 0.781 (0.364)     Acc@5 4.688 (1.527)
Epoch: [0][470/10010]   Time 0.212 (3.383)      Data 0.000 (3.231)      Loss 6.5489 (6.8081)    Acc@1 0.781 (0.370)     Acc@5 7.031 (1.558)
Epoch: [0][480/10010]   Time 8.418 (3.389)      Data 8.331 (3.237)      Loss 6.6245 (6.8034)    Acc@1 0.781 (0.377)     Acc@5 1.562 (1.569)
Epoch: [0][490/10010]   Time 0.211 (3.388)      Data 0.000 (3.237)      Loss 6.6849 (6.7994)    Acc@1 1.562 (0.380)     Acc@5 2.344 (1.593)
Epoch: [0][500/10010]   Time 6.984 (3.388)      Data 6.890 (3.237)      Loss 6.4890 (6.7949)    Acc@1 0.781 (0.379)     Acc@5 3.906 (1.616)
Epoch: [0][510/10010]   Time 0.212 (3.391)      Data 0.000 (3.239)      Loss 6.6416 (6.7910)    Acc@1 0.781 (0.382)     Acc@5 2.344 (1.642)
Epoch: [0][520/10010]   Time 2.660 (3.382)      Data 2.572 (3.231)      Loss 6.5715 (6.7870)    Acc@1 0.781 (0.385)     Acc@5 1.562 (1.660)
Epoch: [0][530/10010]   Time 0.212 (3.388)      Data 0.000 (3.236)      Loss 6.5645 (6.7825)    Acc@1 0.781 (0.393)     Acc@5 2.344 (1.680)
Epoch: [0][540/10010]   Time 1.908 (3.379)      Data 1.820 (3.228)      Loss 6.4077 (6.7779)    Acc@1 2.344 (0.394)     Acc@5 3.906 (1.692)
Epoch: [0][550/10010]   Time 0.213 (3.381)      Data 0.000 (3.230)      Loss 6.5599 (6.7736)    Acc@1 0.000 (0.397)     Acc@5 0.781 (1.704)
Epoch: [0][560/10010]   Time 0.856 (3.369)      Data 0.768 (3.218)      Loss 6.6386 (6.7695)    Acc@1 0.781 (0.401)     Acc@5 1.562 (1.732)
Epoch: [0][570/10010]   Time 0.229 (3.377)      Data 0.000 (3.226)      Loss 6.5827 (6.7652)    Acc@1 0.781 (0.409)     Acc@5 3.125 (1.760)
Epoch: [0][580/10010]   Time 0.975 (3.364)      Data 0.887 (3.213)      Loss 6.4518 (6.7610)    Acc@1 0.781 (0.413)     Acc@5 5.469 (1.779)
Epoch: [0][590/10010]   Time 0.212 (3.370)      Data 0.000 (3.219)      Loss 6.5656 (6.7565)    Acc@1 0.000 (0.428)     Acc@5 2.344 (1.823)
Epoch: [0][600/10010]   Time 0.212 (3.355)      Data 0.046 (3.203)      Loss 6.4239 (6.7520)    Acc@1 0.781 (0.437)     Acc@5 3.125 (1.851)
Epoch: [0][610/10010]   Time 0.211 (3.363)      Data 0.000 (3.212)      Loss 6.3226 (6.7474)    Acc@1 1.562 (0.445)     Acc@5 6.250 (1.880)
Epoch: [0][620/10010]   Time 0.214 (3.350)      Data 0.000 (3.198)      Loss 6.5112 (6.7432)    Acc@1 1.562 (0.452)     Acc@5 5.469 (1.906)
Epoch: [0][630/10010]   Time 0.226 (3.354)      Data 0.000 (3.201)      Loss 6.4474 (6.7382)    Acc@1 0.781 (0.458)     Acc@5 3.125 (1.946)
Epoch: [0][640/10010]   Time 0.211 (3.341)      Data 0.000 (3.188)      Loss 6.5718 (6.7347)    Acc@1 0.781 (0.463)     Acc@5 3.906 (1.967)
Epoch: [0][650/10010]   Time 0.214 (3.347)      Data 0.000 (3.194)      Loss 6.5053 (6.7297)    Acc@1 0.781 (0.472)     Acc@5 1.562 (2.008)
Epoch: [0][660/10010]   Time 0.212 (3.343)      Data 0.000 (3.189)      Loss 6.3718 (6.7246)    Acc@1 0.781 (0.482)     Acc@5 3.906 (2.044)
Epoch: [0][670/10010]   Time 0.223 (3.366)      Data 0.000 (3.212)      Loss 6.3855 (6.7196)    Acc@1 0.781 (0.496)     Acc@5 3.906 (2.095)
Epoch: [0][680/10010]   Time 0.212 (3.358)      Data 0.000 (3.204)      Loss 6.5520 (6.7149)    Acc@1 0.781 (0.507)     Acc@5 3.906 (2.129)
Epoch: [0][690/10010]   Time 0.212 (3.370)      Data 0.000 (3.216)      Loss 6.3960 (6.7098)    Acc@1 2.344 (0.510)     Acc@5 7.031 (2.156)
Epoch: [0][700/10010]   Time 0.214 (3.360)      Data 0.000 (3.205)      Loss 6.4797 (6.7055)    Acc@1 0.781 (0.519)     Acc@5 2.344 (2.190)
Epoch: [0][710/10010]   Time 0.227 (3.368)      Data 0.000 (3.212)      Loss 6.3497 (6.7008)    Acc@1 3.125 (0.531)     Acc@5 4.688 (2.217)
Epoch: [0][720/10010]   Time 0.213 (3.358)      Data 0.000 (3.203)      Loss 6.3555 (6.6961)    Acc@1 2.344 (0.543)     Acc@5 6.250 (2.256)
Epoch: [0][730/10010]   Time 0.207 (3.376)      Data 0.000 (3.220)      Loss 6.5028 (6.6923)    Acc@1 0.000 (0.544)     Acc@5 2.344 (2.267)
Epoch: [0][740/10010]   Time 0.210 (3.365)      Data 0.000 (3.209)      Loss 6.2173 (6.6880)    Acc@1 2.344 (0.557)     Acc@5 5.469 (2.313)
Epoch: [0][750/10010]   Time 0.209 (3.372)      Data 0.000 (3.215)      Loss 6.5205 (6.6841)    Acc@1 0.000 (0.564)     Acc@5 2.344 (2.335)
Epoch: [0][760/10010]   Time 0.209 (3.359)      Data 0.000 (3.202)      Loss 6.2149 (6.6788)    Acc@1 1.562 (0.571)     Acc@5 6.250 (2.367)
Epoch: [0][770/10010]   Time 0.209 (3.364)      Data 0.000 (3.207)      Loss 6.4612 (6.6749)    Acc@1 1.562 (0.586)     Acc@5 3.906 (2.403)
Epoch: [0][780/10010]   Time 0.208 (3.353)      Data 0.000 (3.196)      Loss 6.3526 (6.6705)    Acc@1 0.000 (0.598)     Acc@5 3.906 (2.439)
Epoch: [0][790/10010]   Time 0.210 (3.359)      Data 0.000 (3.202)      Loss 6.2106 (6.6650)    Acc@1 0.781 (0.607)     Acc@5 3.906 (2.469)
Epoch: [0][800/10010]   Time 0.209 (3.353)      Data 0.000 (3.195)      Loss 6.1517 (6.6601)    Acc@1 3.906 (0.615)     Acc@5 8.594 (2.503)

after I saw this issue, I tried installing pytorch-0.4.1 (python3.6, cuda 9.0)to rerun the code, however I met the following error message.

main.py:87: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '
=> creating model 'resnet18'
Traceback (most recent call last):
  File "main.py", line 344, in <module>
    main()
  File "main.py", line 152, in main
    normalize,
  File "/home/sirius/document/siriusShare/Clustering-Face/arcface-pytorch-master/code/Efficient-PyTorch-master/tools/folder2lmdb.py", line 31, in __init__
    self.length =pa.deserialize(txn.get(b'__len__'))
  File "pyarrow/serialization.pxi", line 458, in pyarrow.lib.deserialize
  File "pyarrow/serialization.pxi", line 420, in pyarrow.lib.deserialize_from
  File "pyarrow/serialization.pxi", line 397, in pyarrow.lib.read_serialized
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Cannot read a negative number of bytes from BufferReader.

Hi, can you tell us which python, cuda, pytorch, pyarrow version you were using? Thanks very much for your help (I've been spend weeks for solving this problem, I tried hdf5 and DALI before, but they did not solve the the problem. Since the official ImageNet classification training also has GPU utilization of 100%,0%,100%,0%,100%,0%...)

lizhenstat avatar Nov 05 '21 09:11 lizhenstat