Text detection model performance
Hi, @robertknight I've been evaluating your text detection model "text-detection-checkpoint-ssfbcj81.pt" and found that its performance is lower than expected for my use case. Could you share more details about:
- The specific datasets used for training this model
- Whether the training process was completed to convergence
- Any specific preprocessing or usage requirements I should be aware of to achieve optimal results
I'm trying to understand if there's something I've missed in my implementation or if the model has known limitations for certain types of documents or text characteristics.
The text detection threshold of this model is different than the default one. See https://github.com/robertknight/ocrs/discussions/160#discussioncomment-12939717. If using Ocrs, changing the threshold currently requires editing this value in the source: https://github.com/robertknight/ocrs/blob/0e85f3bace12b37b15b7b025c53c0d800caa23f0/ocrs/src/detection.rs#L33.
As for the training dataset and process, this is covered in the README: https://github.com/robertknight/ocrs-models?tab=readme-ov-file#datasets.
The metrics for the training run that produced this model are https://wandb.ai/robertknight/text-detection/runs/ssfbcj81?nw=nwuserrobertknight. Most of the metrics are pixel level so unfortunately not directly interpretable as "how well does text box extraction work". For future runs it would be better gather metrics that more directly reflect the final output after post-processing, or change the model architecture to more directly predict boxes instead of a segmentation mask.
Hi @robertknight, I followed what you said above. My result looks like as follows after 100 epochs:
'recall': '0.401', 'split_frac': '0.040', 'merged_frac': '0.089', 'precision': '0.443'
dataset: HierText
Did I do anything wrong or miss anything necessary?
Any suggestions would be highly appreciated
Did you start training from scratch or did you try to fine-tune an existing checkpoint?
It is helpful to visualize the outputs at different stages to understand errors better. The model itself outputs a pixel-level text/not-text probability. This is then thresholded to get a binary text/not-text classification. Finally post-processing finds connected components in the image and gets the minimum-area oriented bounding rectangles of those. You can get a feel for what these look like using the --text-map, --text-mask and --png flags of the Ocrs CLI tool (see ocrs --help for info).
If you are training or fine-tuning your own model, visualizing training progress using the Weights and Biases integration is helpful. You can check these metrics against previous training runs at https://wandb.ai/robertknight/text-detection?nw=nwuserrobertknight.
Hi @robertknight, Thank you for your quick reply I will follow what you suggested.