docling icon indicating copy to clipboard operation
docling copied to clipboard

Help in debugging conversion of a PDF to text

Open bkosowski opened this issue 1 year ago • 0 comments

I'm trying to convert a pdf file (free, openly available file): Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf

Using the following command:

docling --device cuda --num-threads 8 --table-mode accurate  --ocr-lang en --from pdf --to text --ocr --verbose "Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf" --debug-visualize-ocr --debug-visualize-cells --debug-visualize-layout

on

docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Debug images are generated, but they're not very helpful. I'm posting examples for ocr, cells, and layout. Cells: cells_page_00001 Layout: postprocessed_layout_page_00001 OCR: ocr_page_00001

The generated file: Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).txt

The problem is that in the PDF, on the second page, just above the line demarcating the main text from the footnotes there is this text:

Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature. That is, the net sum or iterative comparison (one by one) of

This fragment of the text has not been converted, so it's missing from the generated text file. (Of course, this is just one example of a missing text from the PDF.)

I investigated further by converting the file to pictures:

convert -background white -alpha remove -density 300 +antialias -interpolate Nearest -quality 90 "Alejandro Villamor - Subtracting Suffering - An Anti-Aggregationist Approach to Suffering in Nature (2024).pdf" /mnt/d/imgs/page-%d.png

and then running easyocr manually:

easyocr --download_enabled True --detector True --decoder beamsearch --workers 4 --paragraph True --lang en --gpu False --verbose True -f "D:\imgs\page-1.png"

And after a long while it generated the below output:

D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:221: RuntimeWarning: overflow encountered in scalar add
  curr.entries[labeling].prTotal += prBlank + prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:248: RuntimeWarning: overflow encountered in scalar add
  curr.entries[newLabeling].prNonBlank += prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:249: RuntimeWarning: overflow encountered in scalar add
  curr.entries[newLabeling].prTotal += prNonBlank
D:\AI\Tools\docling\venv\Lib\site-packages\easyocr\utils.py:219: RuntimeWarning: overflow encountered in scalar add
  curr.entries[labeling].prNonBlank += prNonBlank
[[[656, 289], [1821, 289], [1821, 420], [656, 420]], 'Subtracting Suffering: An Anti-Aggregationist A Alejandro Villamor Iglesias']
[[[1141, 504], [1338, 504], [1338, 560], [1141, 560]], 'Resumen']
[[[344, 576], [2137, 576], [2137, 1422], [344, 1422]], 'En los ultimos anos, cada vez es ma prevalencia del sufrimiento sobre el b Esta creencia suele coincid Una axiologia sensocentrista segun la cual lo moralmente relevante placer y dolor: Esta combinacion conduce tiene una enorme relevancia moral, Est y argumenta, en su lugar, que podria no ser coherente: La afirmacion de que existe una prevalencia ontologica, en abstracto, del s embargo, no sucede lo mismo al respecto de su puede considerar que un calculo agregacionista sea moralmente valioso, estri pues no hay sujeto que lo sienta. No obstante, podria mantenerse la necesidad d una intervencion positiva en la naturaleza  Palabras clave: agregacionismo,  antiagregacionismo, etica animal, sufrimiento animal, intervencionismo.']
[[[351, 1537], [666, 1537], [666, 1586], [351, 1586]], '1. Introduction']
[[[344, 1605], [2141, 1605], [2141, 2277], [344, 2277]], "In recent decades; more and more p suffering of non-human animals. In aca phenomenon translates into a growing theoretical interest in the suffering of wild animals (e.g;: Dawkins, 1995; Rolston III, 1992 Horta, 2010a, 2010b, 2015; Faria, 2016; Villamor, 2 Although not a necessary condition;' most of these authors maintain that som that suffering predominates over well-being aggregationist  component? into their  theories, these positions  conduct a controversial inference from the following statement: Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature: That i comparison (one by one) of"]
[[[344, 2398], [2135, 2398], [2135, 2883], [344, 2883]], "It is important to remember that there is no relation of necessity between consequentialism and aggregationism: Some   theories, such as Maximin or Leximin, are clearly  consequentialists but not aggregationists (Hirose, 2015, 30-31) Likewise, as Hirose has shown, a be present in deontological theories such as Scanlon's th 2 Even though the consequences could be s a conception of additive aggregation. As Larry Temkin emphasize for example, one might have principles o on weighted  totals, like  prioritarianism, OI on the highest or best   achievements, like some forms   of perfectionism, 0 on the wellbeing of those who are worst off, like max"]
[[[1000, 2925], [1479, 2925], [1479, 2974], [1000, 2974]], 'RHV, 2024, No 26,243-267']
[[[1025, 3043], [1054, 3043], [1054, 3058], [1025, 3058]], 'CC']
[[[1074, 3032], [1473, 3032], [1473, 3083], [1074, 3083]], 'CC BY-NC-ND BY Nc ND']
[[[1200, 3157], [1279, 3157], [1279, 3206], [1200, 3206]], '244']

As can be seen, the missing fragment is present there:

Ontological Prevalence of Suffering in Nature. There is an ontological prevalence of suffering over welfare in nature: That i comparison (one by one) of

So, the OCR seems to be working OK. Something else fails in the process, but I don't know what.

I don't know which step in the conversion fails. Hence, I don't know where should I post a specific bug report: here or in a dependent project. Could you please help?

bkosowski avatar Dec 20 '24 20:12 bkosowski