TextRecognitionDataGenerator
TextRecognitionDataGenerator copied to clipboard
Generator is very slow
I am trying to generate images using generate from string as I have list of strings to generate from , the issue that it's very slow as generating one image per second
That is extremely slow, can you post the command that you used?
Also, what hardware are you using?
Here's the script I am using to generate data , I am training on very powerful cloud machine with 6 cpu cores and around 50 gb ram https://github.com/Mohamed209/TextRecognitionDataGenerator/blob/receipts_ocr/generate_training_lines.py
I'll try and reproduce the issue on my side, I'll report back soon.
Okay so quickly:
- Add multiprocessing
- Properly time the different parts of the scripts, has not everything is the call to TRDG.
You can run py-spy
to see function calls and see what is taking time.
as workaround i used parallel processing technique to boost generation , gained more speed but its not the optimal as generating my dataset around half million images would take with this new rate about 10 to 12 hours
`if name == "main":
print("started generating arabic lines :)")
Parallel(n_jobs=-1)(delayed(save_lines)(img, lbl)
for img, lbl in tqdm(mixed_generator))
print("started generating english lines :)")
Parallel(n_jobs=-1)(delayed(save_lines)(img, lbl)
for img, lbl in tqdm(english_generator))`
You initial comment was removed/edited away. But if you really generate images with 40 words/image, having a performance speed of 11-13 imgs/sec is not that bad.
I can try to see if there are low hanging fruits in the code, but since the project does a lot of image manipulations I don't know if I will get a big improvements.
@Belval i am generating around 40 characters per image not 40 words , but this number is the worst case i have text samples that is much lower than 40 , i feel that if the string to be generated is small in length then processing will be much more fast , but in my case string on average may contain from 10:20 chars so rendering data is slow , i will investigate this more in next few days here is my new full script https://github.com/Mohamed209/TextRecognitionDataGenerator/blob/receipts_ocr/generate_training_lines.py
I see. I never benchmarked each options, so maybe try removing one of these lines and measure the impact of processing time:
distorsion_type=np.random.choice(distorsion_type),
skewing_angle=np.random.choice(skewing_angle),
blur=np.random.choice(blur),
Ok I will try
Ok I will try
One reason why the generate()
function is slow is that it reloads the TF graph/session for each text sample! It can be easily rewritten to a class which initializes its own graph/session, loads the model once and then it only uses it for predictions. This can save some 1-2 s for each invocation.
class HandwrittenGenerator:
def __init__(self):
base_dir = download_model_weights()
model_dir = os.path.join(base_dir, "handwritten_model")
path = os.path.join(model_dir, "translation.pkl")
with open(path, "rb") as file:
self.translation = pickle.load(file)
self.graph = tf.Graph()
self.session = tf.compat.v1.Session(graph=self.graph)
with self.graph.as_default(), self.session.as_default():
saver = tf.compat.v1.train.import_meta_graph(os.path.join(model_dir, "model-29.meta"))
saver.restore(self.session, os.path.join(model_dir, "model-29"))
def generate(self, text, text_color="black"):
with self.graph.as_default(), self.session.as_default():
images = []
colors = [ImageColor.getrgb(c) for c in text_color.split(",")]
c1, c2 = colors[0], colors[-1]
color = "#{:02x}{:02x}{:02x}".format(
rnd.randint(min(c1[0], c2[0]), max(c1[0], c2[0])),
rnd.randint(min(c1[1], c2[1]), max(c1[1], c2[1])),
rnd.randint(min(c1[2], c2[2]), max(c1[2], c2[2])),
)
for word in text.split(" "):
_, window_data, kappa_data, stroke_data, coords = _sample_text(
self.session, word, self.translation
)
strokes = np.array(stroke_data)
strokes[:, :2] = np.cumsum(strokes[:, :2], axis=0)
_, maxx = np.min(strokes[:, 0]), np.max(strokes[:, 0])
miny, maxy = np.min(strokes[:, 1]), np.max(strokes[:, 1])
fig, ax = plt.subplots(1, 1)
fig.patch.set_visible(False)
ax.axis("off")
for stroke in _split_strokes(_cumsum(np.array(coords))):
plt.plot(stroke[:, 0], -stroke[:, 1], color=color)
fig.patch.set_alpha(0)
fig.patch.set_facecolor("none")
canvas = plt.get_current_fig_manager().canvas
canvas.draw()
s, (width, height) = canvas.print_to_buffer()
image = Image.frombytes("RGBA", (width, height), s)
mask = Image.new("RGB", (width, height), (0, 0, 0))
images.append(_crop_white_borders(image))
plt.close()
return _join_images(images), mask
Then call as:
# initialize once - 1-2 s
generator = HandwrittenGenerator()
for text in texts:
# < 1 sec or more, depending on text length
img, mask = generator.generate('your text here', 'black')
# ...
As for running this in parallel, I'm afraid it could work only when its IO dominated (which likely is). Otherwise multiple TF sessions would compete for resources. Note that in Docker TF detect CPU core count based on the host machine, not the container quota, which may result in too many threads competing for limited resources. This can be detected and set to the session config.
The other reason is that it calls the TF session.run() for each stroke in a loop. I'm not sure if this can be improved to run the whole prediction at once.
Another thing is there's no batching. Eg. for many texts we could perform the steps in parallel. But the code would get more complex.