Hiromu Hota
Hiromu Hota
Hi all, I'm happy to take "4 - Sharing models and tokenizers"!
I'm wondering if anyone wants to discuss things more casually and in real-time. There is #japanese-translations channel in HF's discord, hope to see you guys there! > We've also created...
@younesbelkada Thanks for your interest! I believe you can use this http://hf.co/join/discord (or see https://discuss.huggingface.co/t/join-the-hugging-face-discord/11263).
@younesbelkada and those who are interested, once you've joined HF's discord, react to this message in #course-translations with a :hugging_earth: emoji to show **hidden** channels like #japanese-translations.
Can https://spacy.io/api/goldparse#align be useful to align two list of words?
@lukehsiao What exactly was the issue? I recognize that "the PDF words, the 1, 2, and 3 appear, whereas they do not in the HTML." but I couldn't see that...
Thank you guys for your recollections. I think the 1st problem is easier to solve but the 2nd one is much harder or even **wouldn't fix**. They would never be...
I've tested pdftotree for the first time and learned that `pdftotree tests/data/pdf_simple/md.pdf` gives me ```html Sample Markdown ``` I think we can just take this approach that embeds coordinates (top,...
According to https://documents.icar-us.eu/documents/2016/12/report-on-file-formats-for-hand-written-text-recognition-htr-material.pdf, there are 4 major file formats for OCR: 1. PAGE XML 2. ALTO XML 3. ABBYY FineReader XML 4. hOCR I'd propose to support hOCR because: 1....
The apply/reducer architecture, which has been used by the snorkel-extraction project, may be used here too.