TextRecognitionDataGenerator icon indicating copy to clipboard operation
TextRecognitionDataGenerator copied to clipboard

Generate rendered text in PyTorch dataset

Open aleSuglia opened this issue 2 years ago • 1 comments

Hi there,

I would like to use your Python API directly in my PyTorch dataset to render text from a dataset of strings. Unfortunately, I cannot keep in memory my dataset so I am not able to use the GeneratorFromStrings class. An easy solution would be to instantiate the GeneratorFromStrings all the times I want to generate a new image inside the __getitem__. However, I feel this will be very costly because the code has to load a new font all the times. Any suggestions?

aleSuglia avatar Jun 08 '22 13:06 aleSuglia

Your PyTorch dataset could hold the GeneratorFromStrings object and you can implement a lazy loading for the strings in your dataset.

Since you dataset is only strings you could most likely split it with split -l 20000 filename and only load the right sample file.

Just an idea, I don't think I have enough visibility to really help you, but you are most likely right about instantiating GenerateFromStrings that would be costly.

Belval avatar Jun 27 '22 02:06 Belval