LaTeX-OCR compatible with `torch.utils.data.DataLoader`

This PR should be submitted at 5 hours ago, but something weird happened.

         for im in ims:
             print(im,self.img_list.index(im), len([_ for _ in self.img_list if _ ==im]))

when I try to through print image path to verify my modification is correct, I found the same path printed 2-4 times in the log file. So I think there may be something wrong with my code and try to figure out it.

dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485

And the reason is each subprocess will execute iter(self) method, then re-shuffle again.

https://github.com/lukas-blecher/LaTeX-OCR/blob/9f974c9c855535e1d1435236f7367cbefe6bbdda/pix2tex/dataset/dataset.py#L89-L99

One image path shuffled multi-times, each time at different position and then sliced in different subprocess.

self.pairs = self.pairs[self.start:self.end]

I also try to set shuffle parameter to dataloader, but got below error.

Traceback (most recent call last):
  File "../test/test_dataset.py", line 27, in <module>
    dataset, num_workers=8, worker_init_fn=worker_init_fn, shuffle=True)
  File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 236, in __init__
    "shuffle option, but got shuffle={}".format(shuffle))
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

I think the only way to shuffle each epoch is not shuffle all dataset, but at sub-process internal self.pairs.

        self.pairs = self.pairs[self.start:self.end]
        if self.shuffle:
            self.pairs = np.random.permutation(np.array(self.pairs, dtype=object))
        else:
            self.pairs = np.array(self.pairs, dtype=object)
        self.size = len(self.pairs)

Feel free to modify any inappropriate part. ^_^

Here is the test code.

import time
import torch
from pix2tex.dataset.dataset import Im2LatexDataset, worker_init_fn
from pix2tex.utils import *
from tqdm import tqdm
im2latex = Im2LatexDataset()
dataset = im2latex.load(
    "/mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
# pbar = tqdm(enumerate(iter(dataset)), total=len(dataset))
dataloader = torch.utils.data.DataLoader(
    dataset, num_workers=8, worker_init_fn=worker_init_fn)

if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            if seq is None or im is None:
                continue
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

May 23 '22 15:05 TITC

Found a solution for the shuffling problem. Not perfect but I can't think of anything better. Also moved the worker init into the iter function.

Test:

import time
import numpy as np
from pix2tex.dataset.dataset import Im2LatexDataset
from pix2tex.utils import *
from tqdm import tqdm
from torch.utils.data import DataLoader
from hashlib import blake2b

dataset = Im2LatexDataset().load("mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)


def tostr(tokens):
    return post_process(token2str(tokens, dataset.tokenizer)[0])


if __name__ == '__main__':
    start_time = time.time()
    for e in range(0, 2):
        seqhashs = []
        pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
        for i, (seq, im) in pbar:
            seqhashs.extend([blake2b(bytes(tostr(s), encoding='utf-8'),
                                     digest_size=8).hexdigest() for s in seq['input_ids']])
        dataset._shuffle() #Is necessary now for shuffling. Not perfect.
        print('Only unique samples?', len(seqhashs) == np.unique(seqhashs).shape[0])
        print("="*50)
    end_time = time.time()
    # print with 3 decimal places
    print("Time taken: %.3f seconds" % (end_time - start_time))

May 23 '22 22:05 lukas-blecher

I think that's good enough, maybe write a class to wrap torch.utils.data.DataLoader then call dataset._shuffle() in it is also a method but add many redundant codes, not pretty sure. But I think writing a class to solve shuffle maybe to extravagant.

one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

May 24 '22 06:05 TITC

Yes, you're right. We can write a custom Dataloader class as well that could handle most things internally.

one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.

Thanks for pointing that out. I only wanted to check if I did no mistake when coding and make sure all samples are being served.

May 24 '22 09:05 lukas-blecher

can't wait to use the new data loader, do we directly merge it or write a custom Dataloader class, I can do that as a program practice.

May 24 '22 09:05 TITC

I would inherit from the torch dataloader and overwrite the __iter__ and later __init__ functions to match the functionality of the parent. in iter: call dataset shuffle in init: set dataset batchsize, shuffle and call super after that. Something like that Will quickly test but I don't have much time the next couple days :)

May 24 '22 10:05 lukas-blecher

like this

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, batch_size=1, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.dataset.update(batchsize=batch_size, shuffle=shuffle)
        super().__init__(self.dataset, *args, shuffle=False, batch_size=None, **kwargs)

    def __iter__(self):
        self.dataset._shuffle()
        return super().__iter__()

That way we can call the data loader similar to the vanilla torch DataLoader:

dataset = Im2LatexDataset().load(path)
dataloader = Dataloader(dataset, batch_size=10, shuffle=True, num_workers=4)

Feel free to go from here.

May 24 '22 10:05 lukas-blecher

Next steps would be to add num_workers to the config file (default 0 is probably best) and provide it in case it is not present in the config args.num_workers=args.get('num_workers', 0) in the utils.parse_args. Then replace the old dataloader in train and eval.

Also check if pin_memory is working and faster.

May 24 '22 10:05 lukas-blecher

Got it

May 24 '22 11:05 TITC

GPU6-new version:

gpu_devices: [6] #[0,1,2,3,4,5,6,7]
num_workers: 20
...

GPU7-previous version

gpu_devices: [7] #[0,1,2,3,4,5,6,7]
...
same config

Video_20220524205302_Trim

training time compare GPU6 group

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
Loss: 0.3471:  60%|███████████████████████████████████████████████████████████████████████████████████████████▌                                                             | 4648/7767 [08:54<04:58, 10.44it/s]

GPU7 group

BLEU: 0.017, ED: 7.88e+00, ACC: 0.012:   0%|                                                               | 0/389 [00:08<?, ?it/s]
BLEU: 0.169, ED: 1.08e+00, ACC: 0.059:   0%|                                                               | 0/389 [00:07<?, ?it/s]
Loss: 0.6331:  30%|_______________________                                                     | 2396/7891 [09:34<19:28,  4.70it/s]

Thanks for your code review, I will respond after the test is over. Only have the last pin_memory stuff to compare.

Also check if pin_memory is working and faster.

May 24 '22 13:05 TITC

pin_memory not accelerate more in this situation.

class Dataloader(DataLoader):
    def __init__(self, dataset: Im2LatexDataset, shuffle=False, *args, **kwargs):
        self.dataset = dataset
        self.tokenizer = dataset.tokenizer
        self.dataset.update(shuffle=shuffle, *args, **kwargs)
        print("Here is the Dataloader class", kwargs.get('pin_memory',False))# to verify pin_memory work normally
        super().__init__(self.dataset,
                         *args,
                         shuffle=False,
                         batch_size=None,
                         num_workers=kwargs.get('num_workers', 0),
                         pin_memory=kwargs.get('pin_memory', False))

shell output

Here is the Dataloader class True
Here is the Dataloader class True

BLEU: 0.044, ED: 3.46e+00, ACC: 0.035:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.433, ED: 6.02e-01, ACC: 0.267:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
BLEU: 0.517, ED: 4.16e-01, ACC: 0.280:   0%|                                                                                                                                            | 0/301 [00:00<?, ?it/s]
BLEU: 0.552, ED: 4.09e-01, ACC: 0.341:   0%|                                                                                                                                            | 0/301 [00:01<?, ?it/s]
Loss: 0.2323: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 7752/7767 [14:17<00:01,  9.04it/s]
BLEU: 0.575, ED: 4.12e-01, ACC: 0.335:   3%|████▎                                                                                                                              | 10/301 [00:20<10:04,  2.08s/it]
Loss: 0.2353:  16%|████████████████████████▎                                                                                                                                | 1232/7767 [02:46<13:28,  8.09it/s]

May 24 '22 13:05 TITC

I've implemented it the way I was talking about. Feel free to comment.

May 24 '22 16:05 lukas-blecher

eval needs tokenizer https://github.com/lukas-blecher/LaTeX-OCR/blob/d0251fa3ba130e4a29faf4642214fe9f86e60573/pix2tex/eval.py#L55-L56
pin_memory should work in theorem, because when set pin_memory==False, data is in virtual memory but when it turns True, data is kept in memory pages. I'm not sure why it's not work in my scenario, maybe need some further config.

May 25 '22 01:05 TITC

Thanks for catching it
Yes probably some more code is needed. See https://pytorch.org/docs/stable/data.html#memory-pinning

May 25 '22 06:05 lukas-blecher

The gray curve is the new version. All configs are the same. Kinda out of my expectation. The purple curve achieve higher bleu at epoch 5, but the gray one got less bleu even at epoch 20.

May 25 '22 06:05 TITC

That's weird. Something has to be off here. It should perform very similarly. I don't expect the same curves to before, even with the same seed but that is not a good sign. Is the other run from before or with num_workers=0?

Was the grey line at least faster? Can you set the x axis to wall time for comparison?

May 25 '22 06:05 lukas-blecher

The purple run is based on https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d. The wall time compare.

~I thought team will share training log automatically, but it's blank in it. I am trying to find a method share previous log.~

All run moved to team, Feel free to compare.

Faster 2 epoch time than previous version

May 25 '22 07:05 TITC

So none of them are with the new dataloader setup?

May 25 '22 21:05 lukas-blecher

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

May 25 '22 22:05 TITC

Test data at this https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1135895560 is under dim=256, in this case GPU utilization is average at 20%. But the other compare,https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1136795322, dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

It's my comprehension.

I haven't more GPU to reproduce the phenomenon these days. after this run is finished, I will try again the latest commit in this PR.

May 25 '22 23:05 TITC

So none of them are with the new dataloader setup?

The grey one with num_workers=35. The purple one without. That's the only difference between their‘s config.

But the grey run is also based on dbf75d97f5d256ad3eae130ea6688bf2396df18d so the num_workers argument doesn't change anything because it is not implemented yet.

May 26 '22 09:05 lukas-blecher

Test data at this #154 (comment) is under dim=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment), dim =756, GPU utilization approximately at 60%.

With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.

It's my comprehension.

I can't follow your calculation

May 26 '22 09:05 lukas-blecher

The purple run is based on dbf75d9.

But the grey run is also based on https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d so the num_workers argument doesn't change anything because it is not implemented yet.

The purple one is https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d, the grey one is based on 9c6f96e538b49382d6e9f79c695a586bdcff6ddf (only a little modify about tokenizer or args, kwargs stuff). The grey one's code is same with https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1135895560 GPU6-new version.

I will verify the latest commit in the PR to reproduce again so that we can know if that's my local code error or practical phenomenon.

May 26 '22 09:05 TITC

I can't follow your calculation

The calculation is not rigorous, maybe is wrong. I just want to try to explain the speedup gap phenomenon.

Hypothesis num_worker=35 means 35 times speedup. When the progress's most of the time at CPU, then we speedup 35 times than previous, the effect is obvious. When the progress most of the time spend at GPU, only a few portions at CPU, then 35 times speedup doesn't have much meaning.

May 26 '22 09:05 TITC

pix2tex-vit-5.27.11.00

num_worker:20
hash: b90567d07f9c0225fe49b626ddcbf20731050b12

pix2tex-vit-5.25.14.08

hash: dbf75d97f5d256ad3eae130ea6688bf2396df18d

pix2tex-vit-5.24.23.09

num_worker:35
hash: 9c6f96e538b49382d6e9f79c695a586bdcff6ddf

wandb-graph

in the view of elative time at 14 hours, pix2tex-vit-5.27.11.00 achieved epoch 23, pix2tex-vit-5.25.14.08 achieved epoch 16~17.

and the val/bleu speedup group curves trend more closely, so it's not a coincidence.

May 27 '22 23:05 TITC

well that is not very good. I don't know what is going wrong here. It has to do with the new dataloader right?

pix2tex-vit-5.27.11.00

num_worker:20

hash: b90567d

Are you sure you are on the new version?

May 28 '22 10:05 lukas-blecher

Sure, that hash is a misleading message.

the wandb run id is 10uj5emo. based on the id, I can know the conda env name is latex_ocr.

furthermore, check the pix2tex version installed in latex_ocr is as below.

(base) root@x:/yuhang# conda activate latex_ocr
(latex_ocr) root@x:/yuhang# pip list|grep pix2tex
pix2tex                       0.0.26       /yuhang/LaTeX-OCR-TITC
(latex_ocr) root@x:/yuhang# cd ../yuhang/LaTeX-OCR-TITC/
(latex_ocr) root@x:/yuhang/LaTeX-OCR-TITC# git rev-parse HEAD
b90567d07f9c0225fe49b626ddcbf20731050b12

The reason why the logged hash is dbf75d97f5d256ad3eae130ea6688bf2396df18d is due to I run it in another directory, which is your project.

(latex_ocr) root@x:/yuhang# cd LaTeX-OCR
(latex_ocr) root@x:/yuhang/LaTeX-OCR# git rev-parse HEAD
dbf75d97f5d256ad3eae130ea6688bf2396df18d
(latex_ocr) root@x:/yuhang/LaTeX-OCR# conda activate pix2tex
(pix2tex) root@x:/yuhang/LaTeX-OCR# pip list|grep pix2tex
pix2tex                0.0.26       /yuhang/LaTeX-OCR

I have two copies, one named LaTeX-OCR-TITC is my fork version, and LaTeX-OCR is the upstream repo. Although I run the train in /yuhang/LaTeX-OCR but the conda env is latex_ocr which installed through /yuhang/LaTeX-OCR-TITC and hash is b90567d07f9c0225fe49b626ddcbf20731050b12. Because I run it through -m module not directly run train.py.

May 28 '22 15:05 TITC

I have some clue and will keep track with it. The new version only inherited a parent class or wrapped by a class, it shouldn't change the batchsize number.

In the older version. one epoch has 2685 steps or batches in total.

Loss: 0.0162:  52%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                                                                                             | 1393/2685 [26:36<17:12,  1.25it/s]

but the new version has 2560 steps in total/epoch

Loss: 0.4345:  63%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                           | 1619/2560 [24:44<11:05,  1.41it/s]

I don't think that makes sense. The Im2LatexDataset supposed to control all thing related to batchsize and the number of batches, new data loader should only change the number of CPU process.

~You can do your stuff, I will try to figure out it. These days because GPU shortage, I am trying to implement beam search.....This PR also in my doing list.~

And the scale of edit distance curve fluctuation is more severe at the beginning.

May 28 '22 15:05 TITC

I'm just curious, why not just implement the steps of reading data using the torch.utils.data.Dataset, but instead use the Im2LatexDataset?

If Im2LatexDataset is implemented based on torch.utils.data.Dataset, torch.utils.data.DataLoader can be used directly.

Nov 02 '23 02:11 SWHL

LaTeX-OCR LaTeX-OCR copied to clipboard

compatible with `torch.utils.data.DataLoader`

LaTeX-OCR
LaTeX-OCR copied to clipboard