Add support for encoding detection when default encoding is not correct
I noticed that there is a bug in extracting text from PDF when different encodings are contained inside. For example, one of the documents had to convert windows-1251 to windows-1252 for normal reading. Is it possible to implement in such a way that despite the many different encodings inside the document the text is extracted successfully? It is even possible that each token in a pdf can have its own encoding.
@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!
@potaninmt thanks fir raising the issue, can you share a sample pdf? Thanks!
@BobLd Thanks, I'm attaching the file! Example.pdf
@potaninmt unfortunately, I believe this is not a pdfpig issue...
Firefox, Edge and Acrobat Reader are not able to copy the text properly (which is an indicator that the pdf document is not correctly built). In Firefox, copying the following
gives
ñëîæíî ïîêàçàòü, ÷òî åñëè ðàññìîòðåòü áåñêîíå÷íóþ ïîëèíîìèàëüíóþ ñèñòåìó, òî îíà ñîéäåòñÿ ê íåêîòîðîé êîíå÷
your best chance to have correct text extracted might be OCR. Or did you manage to play around with windows-1251 / windows-1252 to get the correct text after extraction?
@BobLd The thing is that there are a lot of such broken pdf's on the internet, but there are solutions to fix them, maybe it would be useful for you to implement it: There is this service which is able to automatically fix the error with encodings: https://2cyr.com/decode/?lang=ru There is also a project on github that can automatically detect the encoding of a text file: https://github.com/yinyue200/ude?tab=readme-ov-file
@BobLd
hi @potaninmt thanks a lot for pointing me to this library, I wasn't aware it even existed and it's extremely interesting.
I had a quick look and the Ude library is based on Mozilla Universal Charset Detector. All implementations I could find of the MUCD are under MLP/GPL2/AGPL2 license, which is not really compatible with PdfPig license.
1 option would be to release a separate NuGet package under the same license, preserving PdfPig. It would also be possible to do our own implementation (trickier).
I'll have a look at the first option, hopefully in the short term. Happy for you to give any implementation advice.
Some references (mainly for me) I found on the topic:
- https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
- https://www.unicode.org/iuc/iuc19/a322.html
- https://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf
- https://web.archive.org/web/20100724020531/http://sourceforge.net/projects/jchardet/files/jchardet/1.1/jchardet-1.1.zip/download
- https://sourceforge.net/projects/cpdetector/files/cpdetector/
- https://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file#4522251
https://github.com/CharsetDetector/UTF-unknown
@Charltsing Thank you!
@BobLd Thank you for reading and responding! I think yes, it would be useful to correct the encodings in the pdf, considering that I've actually encountered quite a few problematic documents. I don't have much experience with encodings, but had the following algorithm idea: Take a specific list of encodings: UTF-8 Unicode Windows-1251 Windows-1252 ... And by brute force (input encoding into bytes with one encoding -> output decoding of bytes with another encoding) determine the maximum plausibility of the text. The number of combinations is not much, for example for 4 encodings it is 16.
Useful links: https://en.wikipedia.org/wiki/Byte_order_mark
@potaninmt I want to take a second look at the issue here. ~~Do you mind sharing how you convert from windows-1251 to windows-1252?~~
Regarding your comment "determine the maximum plausibility of the text", do you have an idea how this could be achieved? I'd be interested to give a go at your approach.
Also, below an example on how to use the UtfUnknown NuGet package (under MLP/GPL2/AGPL2 license) with PdfPig. One point to note is that I'm not sure always using windows-1252 as source encoding works.
using System.Text;
using UglyToad.PdfPig;
using UglyToad.PdfPig.DocumentLayoutAnalysis.PageSegmenter;
using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
using UtfUnknown;
Console.OutputEncoding = System.Text.Encoding.UTF8;
using (var doc = PdfDocument.Open("Example.pdf"))
{
var enconding1252 = Encoding.GetEncoding("windows-1252");
var page = doc.GetPage(3);
var letters = page.Letters;
var words = NearestNeighbourWordExtractor.Instance.GetWords(letters);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
foreach (var block in blocks)
{
string text = block.Text;
byte[] bytes = enconding1252.GetBytes(text);
var result = CharsetDetector.DetectFromBytes(bytes);
Console.WriteLine(result);
Console.WriteLine("TEXT:");
var valueBytes = enconding1252.GetBytes(text);
var correctedBytes = Encoding.Convert(result.Detected.Encoding, Encoding.UTF8, valueBytes);
string corrected = Encoding.UTF8.GetString(correctedBytes);
Console.WriteLine(corrected);
Console.WriteLine();
}
}
Output
Detected: Detected koi8-r with confidence of 0.32337818. (BOM: False),
Details:
- Detected koi8-r with confidence of 0.32337818. (BOM: False)
TEXT:
йнмяоейр ондцнрнбкем ярсдемрюлх, ме опнундхк
опнт педюйрспс х лнфер яндепфюрэ ньхайх
якедхре гю намнбкемхълх мю VK.COM/TEACHINMSU
Detected: Detected windows-1251 with confidence of 0.928417. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.928417. (BOM: False)
TEXT:
Содержание
Detected: Detected windows-1251 with confidence of 0.7587563. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.7587563. (BOM: False)
TEXT:
Лекция 1
Вложенное аффинное алгебраическое многообразие . . . . . . . . . . . . .
Морфизм. Изоморфизм. Автоморфизм . . . . . . . . . . . . . . . . . . . . .
Вычисление групп автоморфизмов . . . . . . . . . . . . . . . . . . . . . . .
Многомерный алгебраический тор . . . . . . . . . . . . . . . . . . . . . . .
Аффинная плоскость . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Открытые проблемы групп автоморфизмов аффинных алгебраических
многообразий . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
5
5
6
7
9
10
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
11
Detected: Detected windows-1251 with confidence of 0.7283408. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.7283408. (BOM: False)
TEXT:
Лекция 2
Формулировка теоремы Юнга . . . . . . . . . . . . . . . . . . . . . . . . . .
Дифференцирование. Локальное нильпотентное дифференцирование . . .
Многоугольники Ньютона . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Доказательство теоремы Юнга . . . . . . . . . . . . . . . . . . . . . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
14
14
14
16
17
Detected: Detected windows-1251 with confidence of 0.72706634. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.72706634. (BOM: False)
TEXT:
Лекция 3
24
Автоморфизм Нагаты . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Формулировка теоремы Шестакова-Умирбаева. Теорема Смит . . . . . . .
25
Алгебра Пуассона. Алгебра Ли . . . . . . . . . . . . . . . . . . . . . . . . .
26
*-редуцированная пара многочленов . . . . . . . . . . . . . . . . . . . . . .
28
Основная техническая лемма . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Автоморфизмы от трех переменных, не понижающие степень при редукции 30
Detected: Detected windows-1251 with confidence of 0.7789312. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.7789312. (BOM: False)
TEXT:
Лекция 4
Завершение доказательств прошлой лекции . . . . . . . . . . . . . . . . . .
Идеи доказательства теоремы Шестакова-Умирбаева . . . . . . . . . . . .
Группы ручных автоморфизмов, свободное произведение . . . . . . . . . .
Амальгамированное произведение . . . . . . . . . . . . . . . . . . . . . . .
Алгебры многочленов от двух переменных есть амальгамированное про-
изведение их подгрупп . . . . . . . . . . . . . . . . . . . . . . . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
32
32
33
34
34
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
34
Detected: Detected windows-1251 with confidence of 0.7882279. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.7882279. (BOM: False)
TEXT:
Лекция 5
Автоморфизм Нагаты дикий над кольцом целых чисел . . . . . . . . . . .
Жјсткие многообразия . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Если группа автоморфизмов алгебраическая, то либо X жјсткое, либо
прямая. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Предложение о единственном максимальном торе. . . . . . . . . . . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
40
40
42
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
42
45
Detected: Detected windows-1251 with confidence of 0.7396398. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.7396398. (BOM: False)
TEXT:
Лекция 6
Доказательство лемм прошлой лекции. . . . . . . . . . . . . . . . . . . . .
Aut(X) - конечное расширение тора. . . . . . . . . . . . . . . . . . . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
48
48
50
Detected: Detected windows-1251 with confidence of 0.84952277. (BOM: False),
Details:
- Detected windows-1251 with confidence of 0.84952277. (BOM: False)
TEXT:
Лекция 7
Неприводимые изолированные полуинварианты. . . . . . . . . . . . . . . .
Доказательство жесткости многообразия и максимальности тора. . . . . .
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
58
58
59
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
3
Detected: Detected ascii with confidence of 1. (BOM: False),
Details:
- Detected ascii with confidence of 1. (BOM: False)
TEXT:
????????-??????????????
????????? ??? ????? ?.?. ??????????
I think OCR would be a better approach in this case. Closing in line with #1095