ocrs icon indicating copy to clipboard operation
ocrs copied to clipboard

Japanese Support

Open dezyh opened this issue 1 year ago • 5 comments

I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.

I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

Related

  • https://github.com/robertknight/ocrs/issues/8

1. Challenges

There's a few major differences from latin scripts which will need to be addressed.

a) Kanji

There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.

b) Layout: Horizontal / Vertical

Text can be written either vertically (縦書き) or horizontally (横書き).

For example: image

c) Annotations: Furigana / Ruby text

Text can have annotations either on the right (for vertical text) or above (for horizontal text).

This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.

This is definitely going to require the WIP layout engine.

For example: image

d) Fonts / Handwriting

I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.

2. Training Data

I will need to conduct more research into this...

a) Datasets

  • Manga109-s
    • Available for commercial use (with some nuances)
    • Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

b) Synthetic data

In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.

Related Projects

  • https://github.com/kha-white/manga-ocr

dezyh avatar Dec 19 '24 13:12 dezyh

I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.

Yes. Script, language or task-specific models are welcome.

In general there is a trade-off between model size and capacity, so even though it is possible to create one large model which recognizes "everything", smaller and more limited models can still be useful.

robertknight avatar Dec 19 '24 14:12 robertknight

Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)

Ocrs's "core" models need to be trained exclusively on openly licensed data, but additional models trained on more restrictive datasets can be created and published, as long as the usage terms are clearly identified.

robertknight avatar Dec 19 '24 14:12 robertknight

https://en.wikipedia.org/wiki/Copyright_law_of_Japan#Public_domain tldr:

The 1899 law protected copyrighted works for 30 years after the author's death.[1] Law changes promulgated in 1970 extended the duration to 50 years in 2004 Japan further extended the copyright term to 70 years for cinematographic works; At the end of 2018, the 70 year term was applied to all works.

unfortunately manga might be lacking

Ocrs's "core" models need to be trained exclusively on openly licensed data, but additional models trained on more restrictive datasets can be created and published, as long as the usage terms are clearly identified.

https://en.wikipedia.org/wiki/Aozora_Bunko i know of this website, they posts all the novels from authors who have been dead for 70+ years, ex. this famous novel i believe the full site is on github: https://github.com/aozorabunko/aozorabunko

aramrw avatar Jan 14 '25 10:01 aramrw

@dezyh Is this something you're still working towards? Or have you found another existing tool you could use?

I've been looking for a tool that can do on-device Japanese OCR and translation. I'm not off to a good start since there's seemingly no OCR solutions available for free or a reasonable price.

MarcG2 avatar Apr 30 '25 23:04 MarcG2

@dezyh Is this something you're still working towards? Or have you found another existing tool you could use?

I've been looking for a tool that can do on-device Japanese OCR and translation. I'm not off to a good start since there's seemingly no OCR solutions available for free or a reasonable price.

Tesseract by google is free and ope source, so is paddleocr, and I'm sure lots more.

I haven't used paddle but I have used tesseract rust bindings and it seems to be alright

aramrw avatar Jun 14 '25 09:06 aramrw