LaTeX-OCR
LaTeX-OCR copied to clipboard
compatible with `torch.utils.data.DataLoader`
This PR should be submitted at 5 hours ago, but something weird happened.
for im in ims:
print(im,self.img_list.index(im), len([_ for _ in self.img_list if _ ==im]))
when I try to through print image path to verify my modification is correct, I found the same path printed 2-4 times in the log file. So I think there may be something wrong with my code and try to figure out it.
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
...
dataset/data/val/0097787.png 5175 1 7762 -2807379882385328485
And the reason is each subprocess will execute iter(self) method, then re-shuffle again.
https://github.com/lukas-blecher/LaTeX-OCR/blob/9f974c9c855535e1d1435236f7367cbefe6bbdda/pix2tex/dataset/dataset.py#L89-L99
One image path shuffled multi-times, each time at different position and then sliced in different subprocess.
self.pairs = self.pairs[self.start:self.end]
I also try to set shuffle parameter to dataloader, but got below error.
Traceback (most recent call last):
File "../test/test_dataset.py", line 27, in <module>
dataset, num_workers=8, worker_init_fn=worker_init_fn, shuffle=True)
File "/home/yhtao/anaconda3/envs/latex_ocr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 236, in __init__
"shuffle option, but got shuffle={}".format(shuffle))
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
I think the only way to shuffle each epoch is not shuffle all dataset, but at sub-process internal self.pairs
.
self.pairs = self.pairs[self.start:self.end]
if self.shuffle:
self.pairs = np.random.permutation(np.array(self.pairs, dtype=object))
else:
self.pairs = np.array(self.pairs, dtype=object)
self.size = len(self.pairs)
Feel free to modify any inappropriate part. ^_^
Here is the test code.
import time
import torch
from pix2tex.dataset.dataset import Im2LatexDataset, worker_init_fn
from pix2tex.utils import *
from tqdm import tqdm
im2latex = Im2LatexDataset()
dataset = im2latex.load(
"/mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
# pbar = tqdm(enumerate(iter(dataset)), total=len(dataset))
dataloader = torch.utils.data.DataLoader(
dataset, num_workers=8, worker_init_fn=worker_init_fn)
if __name__ == '__main__':
start_time = time.time()
for e in range(0, 2):
pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
for i, (seq, im) in pbar:
if seq is None or im is None:
continue
print("="*50)
end_time = time.time()
# print with 3 decimal places
print("Time taken: %.3f seconds" % (end_time - start_time))
Found a solution for the shuffling problem. Not perfect but I can't think of anything better. Also moved the worker init into the iter function.
Test:
import time
import numpy as np
from pix2tex.dataset.dataset import Im2LatexDataset
from pix2tex.utils import *
from tqdm import tqdm
from torch.utils.data import DataLoader
from hashlib import blake2b
dataset = Im2LatexDataset().load("mnt/d/git_repository/LaTeX-OCR/pix2tex/dataset/data/val.pkl")
dataset.update(batchsize=10, test=True, shuffle=True)
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)
def tostr(tokens):
return post_process(token2str(tokens, dataset.tokenizer)[0])
if __name__ == '__main__':
start_time = time.time()
for e in range(0, 2):
seqhashs = []
pbar = tqdm(enumerate(iter(dataloader)), total=len(dataloader))
for i, (seq, im) in pbar:
seqhashs.extend([blake2b(bytes(tostr(s), encoding='utf-8'),
digest_size=8).hexdigest() for s in seq['input_ids']])
dataset._shuffle() #Is necessary now for shuffling. Not perfect.
print('Only unique samples?', len(seqhashs) == np.unique(seqhashs).shape[0])
print("="*50)
end_time = time.time()
# print with 3 decimal places
print("Time taken: %.3f seconds" % (end_time - start_time))
I think that's good enough, maybe write a class to wrap torch.utils.data.DataLoader
then call dataset._shuffle() in it is also a method but add many redundant codes, not pretty sure. But I think writing a class to solve shuffle maybe to extravagant.
one small thing I want to point out in your test code is that seq
can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.
Yes, you're right. We can write a custom Dataloader class as well that could handle most things internally.
one small thing I want to point out in your test code is that seq can be repeated because the handwritten dataset has many different writing styles mapping to the same equation.
Thanks for pointing that out. I only wanted to check if I did no mistake when coding and make sure all samples are being served.
can't wait to use the new data loader, do we directly merge it or write a custom Dataloader
class, I can do that as a program practice.
I would inherit from the torch dataloader and overwrite the __iter__
and later __init__
functions to match the functionality of the parent.
in iter: call dataset shuffle
in init: set dataset batchsize, shuffle and call super
after that. Something like that
Will quickly test but I don't have much time the next couple days :)
like this
class Dataloader(DataLoader):
def __init__(self, dataset: Im2LatexDataset, batch_size=1, shuffle=False, *args, **kwargs):
self.dataset = dataset
self.dataset.update(batchsize=batch_size, shuffle=shuffle)
super().__init__(self.dataset, *args, shuffle=False, batch_size=None, **kwargs)
def __iter__(self):
self.dataset._shuffle()
return super().__iter__()
That way we can call the data loader similar to the vanilla torch DataLoader:
dataset = Im2LatexDataset().load(path)
dataloader = Dataloader(dataset, batch_size=10, shuffle=True, num_workers=4)
Feel free to go from here.
Next steps would be to add num_workers
to the config file (default 0 is probably best) and provide it in case it is not present in the config args.num_workers=args.get('num_workers', 0)
in the utils.parse_args
.
Then replace the old dataloader in train and eval.
Also check if pin_memory
is working and faster.
Got it
GPU6-new version:
gpu_devices: [6] #[0,1,2,3,4,5,6,7]
num_workers: 20
...
GPU7-previous version
gpu_devices: [7] #[0,1,2,3,4,5,6,7]
...
same config
training time compare GPU6 group
BLEU: 0.044, ED: 3.46e+00, ACC: 0.035: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224: 0%| | 0/301 [00:00<?, ?it/s]
Loss: 0.3471: 60%|███████████████████████████████████████████████████████████████████████████████████████████▌ | 4648/7767 [08:54<04:58, 10.44it/s]
GPU7 group
BLEU: 0.017, ED: 7.88e+00, ACC: 0.012: 0%| | 0/389 [00:08<?, ?it/s]
BLEU: 0.169, ED: 1.08e+00, ACC: 0.059: 0%| | 0/389 [00:07<?, ?it/s]
Loss: 0.6331: 30%|_______________________ | 2396/7891 [09:34<19:28, 4.70it/s]
Thanks for your code review, I will respond after the test is over. Only have the last pin_memory
stuff to compare.
Also check if
pin_memory
is working and faster.
pin_memory
not accelerate more in this situation.
class Dataloader(DataLoader):
def __init__(self, dataset: Im2LatexDataset, shuffle=False, *args, **kwargs):
self.dataset = dataset
self.tokenizer = dataset.tokenizer
self.dataset.update(shuffle=shuffle, *args, **kwargs)
print("Here is the Dataloader class", kwargs.get('pin_memory',False))# to verify pin_memory work normally
super().__init__(self.dataset,
*args,
shuffle=False,
batch_size=None,
num_workers=kwargs.get('num_workers', 0),
pin_memory=kwargs.get('pin_memory', False))
shell output
Here is the Dataloader class True
Here is the Dataloader class True
BLEU: 0.044, ED: 3.46e+00, ACC: 0.035: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.176, ED: 1.26e+00, ACC: 0.119: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.280, ED: 6.88e-01, ACC: 0.208: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.370, ED: 6.28e-01, ACC: 0.224: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.433, ED: 6.02e-01, ACC: 0.267: 0%| | 0/301 [00:01<?, ?it/s]
BLEU: 0.517, ED: 4.16e-01, ACC: 0.280: 0%| | 0/301 [00:00<?, ?it/s]
BLEU: 0.552, ED: 4.09e-01, ACC: 0.341: 0%| | 0/301 [00:01<?, ?it/s]
Loss: 0.2323: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 7752/7767 [14:17<00:01, 9.04it/s]
BLEU: 0.575, ED: 4.12e-01, ACC: 0.335: 3%|████▎ | 10/301 [00:20<10:04, 2.08s/it]
Loss: 0.2353: 16%|████████████████████████▎ | 1232/7767 [02:46<13:28, 8.09it/s]
I've implemented it the way I was talking about. Feel free to comment.
-
eval needs tokenizer https://github.com/lukas-blecher/LaTeX-OCR/blob/d0251fa3ba130e4a29faf4642214fe9f86e60573/pix2tex/eval.py#L55-L56
-
pin_memory should work in theorem, because when set
pin_memory
==False, data is in virtual memory but when it turns True, data is kept in memory pages. I'm not sure why it's not work in my scenario, maybe need some further config.
- Thanks for catching it
- Yes probably some more code is needed. See https://pytorch.org/docs/stable/data.html#memory-pinning
The gray curve is the new version. All configs are the same. Kinda out of my expectation. The purple curve achieve higher bleu at epoch 5, but the gray one got less bleu even at epoch 20.
That's weird. Something has to be off here. It should perform very similarly. I don't expect the same curves to before, even with the same seed but that is not a good sign.
Is the other run from before or with num_workers=0
?
Was the grey line at least faster? Can you set the x axis to wall time for comparison?
The purple run is based on https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d. The wall time compare.
~I thought team will share training log automatically, but it's blank in it. I am trying to find a method share previous log.~
All run moved to team, Feel free to compare.
Faster 2 epoch time than previous version
So none of them are with the new dataloader setup?
So none of them are with the new dataloader setup?
The grey one with num_workers
=35. The purple one without. That's the only difference between their‘s config.
Test data at this https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1135895560 is under dim
=256, in this case GPU utilization is average at 20%. But the other compare,https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1136795322, dim
=756, GPU utilization approximately at 60%.
With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.
It's my comprehension.
I haven't more GPU to reproduce the phenomenon these days. after this run is finished, I will try again the latest commit in this PR.
So none of them are with the new dataloader setup?
The grey one with
num_workers
=35. The purple one without. That's the only difference between their‘s config.
But the grey run is also based on dbf75d97f5d256ad3eae130ea6688bf2396df18d so the num_workers
argument doesn't change anything because it is not implemented yet.
Test data at this #154 (comment) is under
dim
=256, in this case GPU utilization is average at 20%. But the other compare,#154 (comment),dim
=756, GPU utilization approximately at 60%.With the first group, new version almost 2 times faster than previous one. In the second group, the gap not reach 2 times, but 2 epoch.
It's my comprehension.
I can't follow your calculation
The purple run is based on dbf75d9.
But the grey run is also based on https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d so the num_workers argument doesn't change anything because it is not implemented yet.
The purple one is https://github.com/lukas-blecher/LaTeX-OCR/commit/dbf75d97f5d256ad3eae130ea6688bf2396df18d, the grey one is based on 9c6f96e538b49382d6e9f79c695a586bdcff6ddf (only a little modify about tokenizer or args, kwargs stuff). The grey one's code is same with https://github.com/lukas-blecher/LaTeX-OCR/pull/154#issuecomment-1135895560 GPU6-new version.
I will verify the latest commit in the PR to reproduce again so that we can know if that's my local code error or practical phenomenon.
I can't follow your calculation
The calculation is not rigorous, maybe is wrong. I just want to try to explain the speedup gap phenomenon.
Hypothesis num_worker=35 means 35 times speedup. When the progress's most of the time at CPU, then we speedup 35 times than previous, the effect is obvious. When the progress most of the time spend at GPU, only a few portions at CPU, then 35 times speedup doesn't have much meaning.
pix2tex-vit-5.27.11.00
- num_worker:20
- hash: b90567d07f9c0225fe49b626ddcbf20731050b12
pix2tex-vit-5.25.14.08
- hash: dbf75d97f5d256ad3eae130ea6688bf2396df18d
pix2tex-vit-5.24.23.09
- num_worker:35
- hash: 9c6f96e538b49382d6e9f79c695a586bdcff6ddf
in the view of elative time at 14 hours, pix2tex-vit-5.27.11.00 achieved epoch 23, pix2tex-vit-5.25.14.08 achieved epoch 16~17.
and the val/bleu speedup group curves trend more closely, so it's not a coincidence.
well that is not very good. I don't know what is going wrong here. It has to do with the new dataloader right?
pix2tex-vit-5.27.11.00
- num_worker:20
- hash: b90567d
Are you sure you are on the new version?
Sure, that hash is a misleading message.
the wandb run id is 10uj5emo.
based on the id, I can know the conda env name is
latex_ocr
.
furthermore, check the pix2tex version installed in latex_ocr
is as below.
(base) root@x:/yuhang# conda activate latex_ocr
(latex_ocr) root@x:/yuhang# pip list|grep pix2tex
pix2tex 0.0.26 /yuhang/LaTeX-OCR-TITC
(latex_ocr) root@x:/yuhang# cd ../yuhang/LaTeX-OCR-TITC/
(latex_ocr) root@x:/yuhang/LaTeX-OCR-TITC# git rev-parse HEAD
b90567d07f9c0225fe49b626ddcbf20731050b12
The reason why the logged hash is dbf75d97f5d256ad3eae130ea6688bf2396df18d is due to I run it in another directory, which is your project.
(latex_ocr) root@x:/yuhang# cd LaTeX-OCR
(latex_ocr) root@x:/yuhang/LaTeX-OCR# git rev-parse HEAD
dbf75d97f5d256ad3eae130ea6688bf2396df18d
(latex_ocr) root@x:/yuhang/LaTeX-OCR# conda activate pix2tex
(pix2tex) root@x:/yuhang/LaTeX-OCR# pip list|grep pix2tex
pix2tex 0.0.26 /yuhang/LaTeX-OCR
I have two copies, one named LaTeX-OCR-TITC
is my fork version, and LaTeX-OCR
is the upstream repo. Although I run the train in /yuhang/LaTeX-OCR
but the conda env is latex_ocr
which installed through /yuhang/LaTeX-OCR-TITC
and hash is b90567d07f9c0225fe49b626ddcbf20731050b12. Because I run it through -m
module not directly run train.py.
I have some clue and will keep track with it. The new version only inherited a parent class or wrapped by a class, it shouldn't change the batchsize number.
In the older version. one epoch has 2685 steps or batches in total.
Loss: 0.0162: 52%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 1393/2685 [26:36<17:12, 1.25it/s]
but the new version has 2560 steps in total/epoch
Loss: 0.4345: 63%|█████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 1619/2560 [24:44<11:05, 1.41it/s]
I don't think that makes sense. The Im2LatexDataset
supposed to control all thing related to batchsize and the number of batches, new data loader should only change the number of CPU process.
~You can do your stuff, I will try to figure out it. These days because GPU shortage, I am trying to implement beam search.....This PR also in my doing list.~
And the scale of edit distance curve fluctuation is more severe at the beginning.
I'm just curious, why not just implement the steps of reading data using the torch.utils.data.Dataset
, but instead use the Im2LatexDataset
?
If Im2LatexDataset is implemented based on torch.utils.data.Dataset
, torch.utils.data.DataLoader
can be used directly.