PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

Add support for encoding detection when default encoding is not correct

Open potaninmt opened this issue 1 year ago • 11 comments

I noticed that there is a bug in extracting text from PDF when different encodings are contained inside. For example, one of the documents had to convert windows-1251 to windows-1252 for normal reading. Is it possible to implement in such a way that despite the many different encodings inside the document the text is extracted successfully? It is even possible that each token in a pdf can have its own encoding.

изображение

potaninmt avatar Dec 14 '24 21:12 potaninmt

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

BobLd avatar Dec 15 '24 00:12 BobLd

@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!

@BobLd Thanks, I'm attaching the file! Example.pdf

potaninmt avatar Dec 15 '24 08:12 potaninmt

@potaninmt unfortunately, I believe this is not a pdfpig issue...

Firefox, Edge and Acrobat Reader are not able to copy the text properly (which is an indicator that the pdf document is not correctly built). In Firefox, copying the following image

gives

ñëîæíî ïîêàçàòü, ÷òî åñëè ðàññìîòðåòü áåñêîíå÷íóþ ïîëèíîìèàëüíóþ ñèñòåìó, òî îíà ñîéäåòñÿ ê íåêîòîðîé êîíå÷

BobLd avatar Dec 15 '24 12:12 BobLd

your best chance to have correct text extracted might be OCR. Or did you manage to play around with windows-1251 / windows-1252 to get the correct text after extraction?

BobLd avatar Dec 15 '24 12:12 BobLd

@BobLd The thing is that there are a lot of such broken pdf's on the internet, but there are solutions to fix them, maybe it would be useful for you to implement it: There is this service which is able to automatically fix the error with encodings: https://2cyr.com/decode/?lang=ru There is also a project on github that can automatically detect the encoding of a text file: https://github.com/yinyue200/ude?tab=readme-ov-file

potaninmt avatar Dec 15 '24 19:12 potaninmt

@BobLd изображение

potaninmt avatar Dec 15 '24 19:12 potaninmt

hi @potaninmt thanks a lot for pointing me to this library, I wasn't aware it even existed and it's extremely interesting.

I had a quick look and the Ude library is based on Mozilla Universal Charset Detector. All implementations I could find of the MUCD are under MLP/GPL2/AGPL2 license, which is not really compatible with PdfPig license.

1 option would be to release a separate NuGet package under the same license, preserving PdfPig. It would also be possible to do our own implementation (trickier).

I'll have a look at the first option, hopefully in the short term. Happy for you to give any implementation advice.

Some references (mainly for me) I found on the topic:

  • https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
  • https://www.unicode.org/iuc/iuc19/a322.html
  • https://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf
  • https://web.archive.org/web/20100724020531/http://sourceforge.net/projects/jchardet/files/jchardet/1.1/jchardet-1.1.zip/download
  • https://sourceforge.net/projects/cpdetector/files/cpdetector/
  • https://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file#4522251

BobLd avatar Dec 16 '24 19:12 BobLd

https://github.com/CharsetDetector/UTF-unknown

Charltsing avatar Dec 17 '24 08:12 Charltsing

@Charltsing Thank you!

potaninmt avatar Dec 17 '24 16:12 potaninmt

@BobLd Thank you for reading and responding! I think yes, it would be useful to correct the encodings in the pdf, considering that I've actually encountered quite a few problematic documents. I don't have much experience with encodings, but had the following algorithm idea: Take a specific list of encodings: UTF-8 Unicode Windows-1251 Windows-1252 ... And by brute force (input encoding into bytes with one encoding -> output decoding of bytes with another encoding) determine the maximum plausibility of the text. The number of combinations is not much, for example for 4 encodings it is 16.

Useful links: https://en.wikipedia.org/wiki/Byte_order_mark

potaninmt avatar Dec 17 '24 17:12 potaninmt

@potaninmt I want to take a second look at the issue here. ~~Do you mind sharing how you convert from windows-1251 to windows-1252?~~

Regarding your comment "determine the maximum plausibility of the text", do you have an idea how this could be achieved? I'd be interested to give a go at your approach.

Also, below an example on how to use the UtfUnknown NuGet package (under MLP/GPL2/AGPL2 license) with PdfPig. One point to note is that I'm not sure always using windows-1252 as source encoding works.

using System.Text;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.PageSegmenter;
using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
using UtfUnknown;

Console.OutputEncoding = System.Text.Encoding.UTF8;

using (var doc = PdfDocument.Open("Example.pdf"))
{
    var enconding1252 = Encoding.GetEncoding("windows-1252");

    var page = doc.GetPage(3);
    var letters = page.Letters;

    var words = NearestNeighbourWordExtractor.Instance.GetWords(letters);
    var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);

    foreach (var block in blocks)
    {
        string text = block.Text;

        byte[] bytes = enconding1252.GetBytes(text);
        var result = CharsetDetector.DetectFromBytes(bytes);

        Console.WriteLine(result);

        Console.WriteLine("TEXT:");
        var valueBytes = enconding1252.GetBytes(text);
        var correctedBytes = Encoding.Convert(result.Detected.Encoding, Encoding.UTF8, valueBytes);
        string corrected = Encoding.UTF8.GetString(correctedBytes);
        
        Console.WriteLine(corrected);
        Console.WriteLine();
    }
}

Output

Detected: Detected koi8-r with confidence of 0.32337818. (BOM: False),
Details:
 - Detected koi8-r with confidence of 0.32337818. (BOM: False)
TEXT:
йнмяоейр ондцнрнбкем ярсдемрюлх, ме опнундхк
опнт педюйрспс х лнфер яндепфюрэ ньхайх
якедхре гю намнбкемхълх мю VK.COM/TEACHINMSU

Detected: Detected windows-1251 with confidence of 0.928417. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.928417. (BOM: False)
TEXT:
Содержание

Detected: Detected windows-1251 with confidence of 0.7587563. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.7587563. (BOM: False)
TEXT:
Лекция 1
Вложенное аффинное алгебраическое многообразие . . . . . . . . . . . . .
Морфизм. Изоморфизм. Автоморфизм . . . . . . . . . . . . . . . . . . . . .
Вычисление групп автоморфизмов . . . . . . . . . . . . . . . . . . . . . . .
Многомерный алгебраический тор . . . . . . . . . . . . . . . . . . . . . . .
Аффинная плоскость . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Открытые проблемы групп автоморфизмов аффинных алгебраических
многообразий . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
5
5
6
7
9
10

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
11

Detected: Detected windows-1251 with confidence of 0.7283408. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.7283408. (BOM: False)
TEXT:
Лекция 2
Формулировка теоремы Юнга . . . . . . . . . . . . . . . . . . . . . . . . . .
Дифференцирование. Локальное нильпотентное дифференцирование . . .
Многоугольники Ньютона . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Доказательство теоремы Юнга . . . . . . . . . . . . . . . . . . . . . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
14
14
14
16
17

Detected: Detected windows-1251 with confidence of 0.72706634. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.72706634. (BOM: False)
TEXT:
Лекция 3
24
Автоморфизм Нагаты . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Формулировка теоремы Шестакова-Умирбаева. Теорема Смит . . . . . . .
25
Алгебра Пуассона. Алгебра Ли . . . . . . . . . . . . . . . . . . . . . . . . .
26
*-редуцированная пара многочленов . . . . . . . . . . . . . . . . . . . . . .
28
Основная техническая лемма . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Автоморфизмы от трех переменных, не понижающие степень при редукции 30

Detected: Detected windows-1251 with confidence of 0.7789312. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.7789312. (BOM: False)
TEXT:
Лекция 4
Завершение доказательств прошлой лекции . . . . . . . . . . . . . . . . . .
Идеи доказательства теоремы Шестакова-Умирбаева . . . . . . . . . . . .
Группы ручных автоморфизмов, свободное произведение . . . . . . . . . .
Амальгамированное произведение . . . . . . . . . . . . . . . . . . . . . . .
Алгебры многочленов от двух переменных есть амальгамированное про-
изведение их подгрупп . . . . . . . . . . . . . . . . . . . . . . . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
32
32
33
34
34

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
34

Detected: Detected windows-1251 with confidence of 0.7882279. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.7882279. (BOM: False)
TEXT:
Лекция 5
Автоморфизм Нагаты дикий над кольцом целых чисел . . . . . . . . . . .
Жјсткие многообразия . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Если группа автоморфизмов алгебраическая, то либо X жјсткое, либо
прямая. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Предложение о единственном максимальном торе. . . . . . . . . . . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
40
40
42

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
42
45

Detected: Detected windows-1251 with confidence of 0.7396398. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.7396398. (BOM: False)
TEXT:
Лекция 6
Доказательство лемм прошлой лекции. . . . . . . . . . . . . . . . . . . . .
Aut(X) - конечное расширение тора. . . . . . . . . . . . . . . . . . . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
48
48
50

Detected: Detected windows-1251 with confidence of 0.84952277. (BOM: False),
Details:
 - Detected windows-1251 with confidence of 0.84952277. (BOM: False)
TEXT:
Лекция 7
Неприводимые изолированные полуинварианты. . . . . . . . . . . . . . . .
Доказательство жесткости многообразия и максимальности тора. . . . . .

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
58
58
59

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
3

Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
 - Detected ascii with confidence of 1. (BOM: False)
TEXT:
????????-??????????????
????????? ??? ????? ?.?. ??????????

BobLd avatar Apr 21 '25 14:04 BobLd

I think OCR would be a better approach in this case. Closing in line with #1095

EliotJones avatar Jul 20 '25 01:07 EliotJones