marker
marker copied to clipboard
Player's Handbook D&D 5th Edition table parsing issues
So I pushed up an OCR'd copy of the PHB and did the first ten pages and got https://gist.github.com/krainboltgreene/48712b8947e20b4594259f90087ae181
Now a few things: Some of these issues are from the OCR of the PDF itself, but I feel like some may be an issue with marker?
Some OCR engines annoyingly put spaces between characters. I think it's due to their expected character spacing heuristics. I suspect that is what is happening. I'm going to try to train an OCR model that doesn't do this in the next couple of months.
Did you try it with the postprocessor model enabled? (set ENABLE_EDITOR_MODEL
). That might improve things.
Can you share the source pdf?