archive-pdf-tools icon indicating copy to clipboard operation
archive-pdf-tools copied to clipboard

correct ratio determination for noise estimation

Open rmast opened this issue 3 years ago • 5 comments

I solved issue https://github.com/internetarchive/archive-pdf-tools/issues/52 myself.

rmast avatar Jun 25 '22 17:06 rmast

Thanks -- I will review this tonight or tomorrow at latest, I'm mostly on the road today.

MerlijnWajer avatar Jun 26 '22 10:06 MerlijnWajer

The second commit is for solving this error: https://github.com/internetarchive/archive-pdf-tools/issues/55#issuecomment-1166449630

rmast avatar Jun 26 '22 11:06 rmast

btw, I think I fixed this in 3c20a464f53ca0524268e35b998036d18b380b45 - can you confirm?

MerlijnWajer avatar Nov 21 '22 14:11 MerlijnWajer

btw, I think I fixed this in 3c20a46 - can you confirm?

Without resetting up and retesting it I read through the issues to see what we were trying to solve. In the text of https://github.com/internetarchive/archive-pdf-tools/issues/52, namely https://github.com/internetarchive/archive-pdf-tools/issues/52#issuecomment-1169598636, I read some inline patch of mrc.py on the inversion that I don't see reflected. So I can imagine not all inversion is handled correctly.

The issue with the double text (Array) is caused by a segmentation bug in Tesseract which I've tried to crack during my summer holiday. However there's too little testing capacity and core-knowledge at Tesseract to allow core-changes to repair this segmentation, which caused the superior EasyOCR-segmentation to emerge.

At the end of my summer holiday this year I tried to get a complete new inversion based on the segmentation of EasyOCR and an algorithm to compare the inner color and the outer color of those found segments for the inversion choice. I unfortunately didn't have the time to mold it into a working product.

rmast avatar Nov 30 '22 09:11 rmast

This Christmas Holiday my attention is distracted by new AI programming capabilities of OpenAI Codex, rolling on the ChatGPT-hype. As I'm really bad at Cython programming I'm trying to let Codex make consistent/improve my code for a new context sensitive inverter. I wonder whether there is an other approach for interpreting and segmenting documents at a more intelligent level: https://x-decoder-vl.github.io/

rmast avatar Dec 30 '22 14:12 rmast