ramSeraph
ramSeraph
I will check that on some sample documents. Thanks for the clarifications.
@vinayak-mehta Would you mind if I post this on the Indian FOSS and opendata channels? This tool has been extremely helpful in dealing with the PDF crap Indian Government puts...
The Malayalam names part of the dataset is also available at https://huggingface.co/datasets/santhosh/english-malayalam-names
I found one possible dataset for printed documents for multiple languages. It is [wikisource](https://wikisource.org/wiki/Main_Page). They have text and images at the page level, originally created using some existing OCR(Google vision/tesseract)...
I think the extension picking logic is picking 'nbvz6aNGQo68xa4NtWH26A' as the tile type. ```javascript function a(e) { var t = e.template , r = "" , a = e.type ,...