Feature Request: New Text Detection Model Support
Version Info
Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Python executable: D:\GalTranslation\BallonsTranslator_dev_src_with_gitpython_20241124\ballontrans_pylibs_win\python.exe Version: 1.4.0 Branch: dev Commit hash: a515a420e3334680c17ca7c6f800e9df2bf513c4
Type of Request
New Feature
Description
Model: https://huggingface.co/ogkalu/comic-text-and-bubble-detector Transformers model, fine-tuned from RT-DETR-v2 r50vd, CPU/GPU inference. Performs much better in text detection in complex scenes than BallonsTranslator integrated models ctd and ysgyolo.
detection examples below, all using default settings:
comic-text-and-bubble-detector:
comic-text-detector:
Pictures
No response
Additional Information
No response
The point is, ctd will mix different sentences from different bubbles usually, but ogkalu/comic-text-and-bubble-detector hardly does so.
Sample code from PekingU/rtdetr_r50vd:
import torch
import requests
from PIL import Image
from transformers import RTDetrForObjectDetection, RTDetrImageProcessor
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image_processor = RTDetrImageProcessor.from_pretrained("ogkalu/comic-text-and-bubble-detector")
model = RTDetrForObjectDetection.from_pretrained("ogkalu/comic-text-and-bubble-detector")
inputs = image_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)
for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
score, label = score.item(), label_id.item()
box = [round(i, 2) for i in box.tolist()]
print(f"{model.config.id2label[label]}: {score:.2f} {box}")
result should be a list with a dict in it, for example:
[
{
'scores': torch.tensor([score1, score2, ..., ]),
'labels': torch.tensor([int1, int2, int3, ..., ]), # 0 is 'bubble' , 1 is 'text_bubble' , 2 is 'text_free'
'boxes': torch.tensor( [[xmin1, ymin1, xmax1, ymax1],
[xmin2, ymin2, xmax2, ymax2],
...
] )
}
]
[
{
'scores': torch.tensor([0.9821, 0.9801, 0.9799, 0.9738, 0.9730, 0.9723, 0.9722,
0.9721, 0.9708, 0.9704, 0.9699, 0.9682, 0.9673, 0.9599,
0.9558, 0.9532, 0.9528, 0.9512, 0.9483, 0.9421, 0.8599]),
'labels': torch.tensor([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 2]),
'boxes': torch.tensor([[-4.5021e-01, 6.9262e+02, 3.7822e+02, 9.2742e+02],
[ 4.7005e+02, 3.7906e+01, 7.5990e+02, 1.8624e+02],
[ 9.0669e+02, 5.0702e+02, 1.3555e+03, 7.2286e+02],
[ 1.0446e+03, 8.3269e+01, 1.3330e+03, 2.1791e+02],
[ 9.3973e+02, 3.5836e+02, 1.3344e+03, 5.1491e+02],
[ 3.3274e+02, 5.0518e+02, 6.1163e+02, 6.6103e+02],
[ 9.4728e+02, 5.3617e+02, 1.3300e+03, 7.0094e+02], # label 1
[ 9.9570e+01, 3.5952e+02, 4.9691e+02, 5.5726e+02],
[ 1.1040e+02, 3.7685e+02, 4.7161e+02, 5.1155e+02], # label 1
[ 9.7417e+02, 3.7178e+02, 1.3234e+03, 4.8909e+02], # label 1
[ 6.7811e+02, 9.7702e+02, 9.3171e+02, 1.1245e+03],
[ 6.8589e+02, 1.5547e+02, 9.1349e+02, 2.8003e+02],
[ 5.3744e+01, 7.1619e+02, 3.3000e+02, 9.0388e+02], # label 1
[ 7.4587e+02, 1.1171e+03, 9.5668e+02, 1.2425e+03],
[ 1.0777e+03, 1.0008e+02, 1.3193e+03, 1.8794e+02], # label 1
[ 4.9696e+02, 6.4532e+01, 7.2389e+02, 1.5935e+02], # label 1
[ 7.2174e+02, 9.8582e+02, 8.8026e+02, 1.1155e+03], # label 1
[ 7.2020e+02, 1.7782e+02, 8.8041e+02, 2.6326e+02], # label 1
[ 3.8203e+02, 5.4301e+02, 5.6116e+02, 6.4346e+02], # label 1
[ 7.7961e+02, 1.1316e+03, 9.3794e+02, 1.2208e+03], # label 1
[ 5.6984e+02, 4.3259e+02, 8.8979e+02, 5.8168e+02]])# label 2
}
]
Inference notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_inference.ipynb Fine-tuning notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_finetune_on_a_custom_dataset.ipynb
I can embed it as a text detector.
https://github.com/dmMaze/BallonsTranslator/issues/863#issuecomment-2817111839
https://github.com/user-attachments/assets/128faac1-566a-4e4e-8433-3d9852a6a9a2
"Try the latest model provided here. It was trained on 220,000 images and should still be able to handle the image you sent. If the recognition isn't good, feel free to send it to me, and we can optimize the model later. I've been tired lately, so I haven't continued collecting data. The next training might be when I reach 250,000 or 300,000 images. But for now, it's still sufficient."
https://github.com/user-attachments/assets/27b002f2-0651-4d2c-895f-f3f4c6392e34
"I found the original comic by using image search. I think the margin of error is within an acceptable range."
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
用你提供的链接试了一下,确实比原来的ctd强上不少,但是漏字和混淆不同文本框的问题还是时不时就有
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_v2_r50vd,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
用你提供的链接试了一下,确实比原来的ctd强上不少,但是漏字和混淆不同文本框的问题还是时不时就有
哈哈我做不到100% 不过 现在比CTD强太多了 我现在跑1000页的漫画处理 起来比CTD轻松不知道多少了
以后有时间再说了 我肝了9 个月 太累了 太枯燥了 数据直接转换就可以拿来给 这2个用 不过懒得研究 还得从头学 部署
而且横向文字的数据量本来就比较少 22万张里 也就有 1 2万? 算多了的 都是气泡数据 和竖条数据
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
等着你的好消息哈哈 先等 bropines 大佬把你这个模型添加上再说
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd*
这边没讲 部署还得去B站
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
等着你的好消息哈哈 先等 bropines 大佬把你这个模型添加上再说
~~I'm just thinking ... Does it make sense? I just haven't seen examples of INLINE text detection. So, yes, it will define text babbles, and INLINE text will be skipped?~~
UPD. I checked it out. It doesn't seem bad. I will try to add it within two days. I'm stuck with my studies, I'm sorry.
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd*
这边没讲 部署还得去B站
By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd* *
这边没讲 部署还得去B站
By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet
I only know how to train models, the rest I don't understand. I learned by following tutorials on video websites. I have no technical skills, just a passion for manga and the drive to take action. I know that by controlling the nms threshold: 0.75 and confidence threshold: 0.65, the recognition results can vary slightly. I often encounter the issue where one text box is detected by two or three rectangles. From the perspective of watching manga, it's not really an issue unless you look closely. However, from an object detection standpoint, this is a big problem. But since the dataset is too small and can't compete with the massive datasets used by large companies, I have to make do with it."
SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."
兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ,或者paddle的PP-OCRv4_server_det;rtdet的预训练模型我没试过,paddle的预训练模型本身就非常强,直接就能拿来用,但是训练和推理的框架比较麻烦一点
404呢
不小心多打了一个字符,这个应该是对的PekingU/rtdetr_v2_r50vd* *
这边没讲 部署还得去B站
By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet
https://github.com/dmMaze/BallonsTranslator/issues/866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.
#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.
Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.
#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.
Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.
"I see you've loaded this model. Can you push an update? I can't wait to try it out."
#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.
Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.
"I see you've loaded this model. Can you push an update? I can't wait to try it out."
I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.
#866* * Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?
Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.
Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.
"I see you've loaded this model. Can you push an update? I can't wait to try it out."
I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.
I’m training an RTDETR with over 27,000 of my own images, and I’m not using the V2 version. https://docs.ultralytics.com/zh/models/rtdetr
For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.
For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.
https://gist.github.com/bropines/d9fd69bec63793220cbb59bc39fbedd0
For now, test it like this. Put the models in the data/models/ctbd folder. In theory, if you train the same way, then your model should load automatically, but I'm not sure. I'm a little tired....
Well, after the tests.
I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.
Well, after the tests.
I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.
Maybe that's why mit48px can not work with it, while other ocr like one-ocr and llm-ocr work well with it.