BallonsTranslator Feature Request: New Text Detection Model Support

Version Info

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Python executable: D:\GalTranslation\BallonsTranslator_dev_src_with_gitpython_20241124\ballontrans_pylibs_win\python.exe Version: 1.4.0 Branch: dev Commit hash: a515a420e3334680c17ca7c6f800e9df2bf513c4

Type of Request

New Feature

Description

Model: https://huggingface.co/ogkalu/comic-text-and-bubble-detector Transformers model, fine-tuned from RT-DETR-v2 r50vd, CPU/GPU inference. Performs much better in text detection in complex scenes than BallonsTranslator integrated models ctd and ysgyolo.

detection examples below, all using default settings:

comic-text-and-bubble-detector:

comic-text-detector:

Pictures

No response

Additional Information

No response

Apr 22 '25 09:04 RoadToNowhereX

The point is, ctd will mix different sentences from different bubbles usually, but ogkalu/comic-text-and-bubble-detector hardly does so.

Apr 22 '25 09:04 RoadToNowhereX

Sample code from PekingU/rtdetr_r50vd:

import torch
import requests

from PIL import Image
from transformers import RTDetrForObjectDetection, RTDetrImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg' 
image = Image.open(requests.get(url, stream=True).raw)

image_processor = RTDetrImageProcessor.from_pretrained("ogkalu/comic-text-and-bubble-detector")
model = RTDetrForObjectDetection.from_pretrained("ogkalu/comic-text-and-bubble-detector")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
    for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
        score, label = score.item(), label_id.item()
        box = [round(i, 2) for i in box.tolist()]
        print(f"{model.config.id2label[label]}: {score:.2f} {box}")

result should be a list with a dict in it, for example:

[
   {
      'scores': torch.tensor([score1, score2, ..., ]),
      'labels': torch.tensor([int1, int2, int3, ..., ]), # 0 is 'bubble' , 1 is 'text_bubble' , 2 is 'text_free'
      'boxes': torch.tensor( [[xmin1, ymin1, xmax1, ymax1],
                                         [xmin2, ymin2, xmax2, ymax2],
                                         ...
                                         ] )
   }
]

[
    {
        'scores': torch.tensor([0.9821, 0.9801, 0.9799, 0.9738, 0.9730, 0.9723, 0.9722, 
                                0.9721, 0.9708, 0.9704, 0.9699, 0.9682, 0.9673, 0.9599, 
                                0.9558, 0.9532, 0.9528, 0.9512, 0.9483, 0.9421, 0.8599]),
        'labels': torch.tensor([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 2]),
        'boxes': torch.tensor([[-4.5021e-01,  6.9262e+02,  3.7822e+02,  9.2742e+02],
                               [ 4.7005e+02,  3.7906e+01,  7.5990e+02,  1.8624e+02],
                               [ 9.0669e+02,  5.0702e+02,  1.3555e+03,  7.2286e+02],
                               [ 1.0446e+03,  8.3269e+01,  1.3330e+03,  2.1791e+02],
                               [ 9.3973e+02,  3.5836e+02,  1.3344e+03,  5.1491e+02],
                               [ 3.3274e+02,  5.0518e+02,  6.1163e+02,  6.6103e+02],
                               [ 9.4728e+02,  5.3617e+02,  1.3300e+03,  7.0094e+02], # label 1
                               [ 9.9570e+01,  3.5952e+02,  4.9691e+02,  5.5726e+02],
                               [ 1.1040e+02,  3.7685e+02,  4.7161e+02,  5.1155e+02], # label 1
                               [ 9.7417e+02,  3.7178e+02,  1.3234e+03,  4.8909e+02], # label 1
                               [ 6.7811e+02,  9.7702e+02,  9.3171e+02,  1.1245e+03],
                               [ 6.8589e+02,  1.5547e+02,  9.1349e+02,  2.8003e+02],
                               [ 5.3744e+01,  7.1619e+02,  3.3000e+02,  9.0388e+02], # label 1
                               [ 7.4587e+02,  1.1171e+03,  9.5668e+02,  1.2425e+03],
                               [ 1.0777e+03,  1.0008e+02,  1.3193e+03,  1.8794e+02], # label 1
                               [ 4.9696e+02,  6.4532e+01,  7.2389e+02,  1.5935e+02], # label 1
                               [ 7.2174e+02,  9.8582e+02,  8.8026e+02,  1.1155e+03], # label 1
                               [ 7.2020e+02,  1.7782e+02,  8.8041e+02,  2.6326e+02], # label 1
                               [ 3.8203e+02,  5.4301e+02,  5.6116e+02,  6.4346e+02], # label 1
                               [ 7.7961e+02,  1.1316e+03,  9.3794e+02,  1.2208e+03], # label 1
                               [ 5.6984e+02,  4.3259e+02,  8.8979e+02,  5.8168e+02]])# label 2
    }
]

Inference notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_inference.ipynb Fine-tuning notebook: https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/RT_DETR_v2_finetune_on_a_custom_dataset.ipynb

Apr 22 '25 10:04 RoadToNowhereX

I can embed it as a text detector.

Apr 22 '25 12:04 bropines

https://github.com/dmMaze/BallonsTranslator/issues/863#issuecomment-2817111839

https://github.com/user-attachments/assets/128faac1-566a-4e4e-8433-3d9852a6a9a2

"Try the latest model provided here. It was trained on 220,000 images and should still be able to handle the image you sent. If the recognition isn't good, feel free to send it to me, and we can optimize the model later. I've been tired lately, so I haven't continued collecting data. The next training might be when I reach 250,000 or 300,000 images. But for now, it's still sufficient."

Apr 22 '25 13:04 lhj5426

https://github.com/user-attachments/assets/27b002f2-0651-4d2c-895f-f3f4c6392e34

"I found the original comic by using image search. I think the margin of error is within an acceptable range."

Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

用你提供的链接试了一下，确实比原来的ctd强上不少，但是漏字和混淆不同文本框的问题还是时不时就有

Apr 22 '25 13:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_v2_r50vd，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

Apr 22 '25 13:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

用你提供的链接试了一下，确实比原来的ctd强上不少，但是漏字和混淆不同文本框的问题还是时不时就有

哈哈我做不到100% 不过现在比CTD强太多了我现在跑1000页的漫画处理起来比CTD轻松不知道多少了以后有时间再说了我肝了9 个月太累了太枯燥了数据直接转换就可以拿来给这2个用不过懒得研究还得从头学部署

而且横向文字的数据量本来就比较少 22万张里也就有 1 2万？算多了的都是气泡数据和竖条数据

Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

等着你的好消息哈哈先等 bropines 大佬把你这个模型添加上再说

Apr 22 '25 13:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

Apr 22 '25 14:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

不小心多打了一个字符，这个应该是对的PekingU/rtdetr_v2_r50vd

Apr 22 '25 14:04 RoadToNowhereX

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

不小心多打了一个字符，这个应该是对的PekingU/rtdetr_v2_r50vd*

这边没讲部署还得去B站

Apr 22 '25 14:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

等着你的好消息哈哈先等 bropines 大佬把你这个模型添加上再说

~~I'm just thinking ... Does it make sense? I just haven't seen examples of INLINE text detection. So, yes, it will define text babbles, and INLINE text will be skipped?~~

UPD. I checked it out. It doesn't seem bad. I will try to add it within two days. I'm stuck with my studies, I'm sorry.

Apr 22 '25 16:04 bropines

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

不小心多打了一个字符，这个应该是对的PekingU/rtdetr_v2_r50vd*

这边没讲部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

Apr 22 '25 16:04 bropines

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

不小心多打了一个字符，这个应该是对的PekingU/rtdetr_v2_r50vd* *

这边没讲部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

I only know how to train models, the rest I don't understand. I learned by following tutorials on video websites. I have no technical skills, just a passion for manga and the drive to take action. I know that by controlling the nms threshold: 0.75 and confidence threshold: 0.65, the recognition results can vary slightly. I often encounter the issue where one text box is detected by two or three rectangles. From the perspective of watching manga, it's not really an issue unless you look closely. However, from an object detection standpoint, this is a big problem. But since the dataset is too small and can't compete with the massive datasets used by large companies, I have to make do with it."

Apr 22 '25 17:04 lhj5426

SHANA.bandicam.2025-04-22.21-22-38-222.mp4 "I found the original comic by using image search. I think the margin of error is within an acceptable range."

兄弟你要是还有心力的话可以试着练一下这个PekingU/rtdetr_r50vd* * * ，或者paddle的PP-OCRv4_server_det；rtdet的预训练模型我没试过，paddle的预训练模型本身就非常强，直接就能拿来用，但是训练和推理的框架比较麻烦一点

404呢

不小心多打了一个字符，这个应该是对的PekingU/rtdetr_v2_r50vd* *

这边没讲部署还得去B站

By the way, your latest detector has a random abnormal bug when it detects the same block of text twice. I suppose this is corrected by the parameters, but which ones... I haven't figured it out yet

https://github.com/dmMaze/BallonsTranslator/issues/866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Apr 22 '25 17:04 lhj5426

#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Apr 23 '25 09:04 bropines

#866 Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

Apr 23 '25 10:04 bropines

#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

"I see you've loaded this model. Can you push an update? I can't wait to try it out."

Apr 23 '25 10:04 lhj5426

#866* Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

"I see you've loaded this model. Can you push an update? I can't wait to try it out."

I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.

Apr 23 '25 14:04 bropines

#866* * Can you address this issue? Is it possible to set an option to subjectively control whether the rectangles should be merged?

Listen. And it's not a bad model at all. Of course, I didn't test what you described above. But this one is really good.

Also, I plan to write a separate discussion about the parameters of the detectors. I basically read yolo and understood what the parameters are responsible for.

"I see you've loaded this model. Can you push an update? I can't wait to try it out."

I'll make a couple of edits. Plus, I'll check the loading of the model files. Now it is automatically swinging from the HF server.

I’m training an RTDETR with over 27,000 of my own images, and I’m not using the V2 version. https://docs.ultralytics.com/zh/models/rtdetr

Apr 23 '25 14:04 lhj5426

For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.

Apr 23 '25 14:04 bropines

For now, I decided to add what was requested. Then we'll think about how much your version is better than Ogkalu.

https://gist.github.com/bropines/d9fd69bec63793220cbb59bc39fbedd0

For now, test it like this. Put the models in the data/models/ctbd folder. In theory, if you train the same way, then your model should load automatically, but I'm not sure. I'm a little tired....

Apr 23 '25 17:04 bropines

Well, after the tests.

I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.

Apr 24 '25 07:04 bropines

Well, after the tests.

I can use it, but I probably won't add it. Why? Because this detector detects blobs and the text in the blobs, and it does it very well, but unfortunately it does not build a mask of text characters like ctd. That is, when training, it is important for us to create a layer not only of babbles, but also a layer for the mask map, which is then transferred to inpaint... in short, there is a lot of work for which I do not have experience and time. The test module is above.

Maybe that's why mit48px can not work with it, while other ocr like one-ocr and llm-ocr work well with it.

Apr 24 '25 08:04 RoadToNowhereX