normcap icon indicating copy to clipboard operation
normcap copied to clipboard

Improve paragraph detection

Open Pawwwle opened this issue 2 years ago • 4 comments

What happened?

The application does not use paragraphs (newlines). Please fix it because inserting paragraphs manually is very cumbersome.

Zrzut ekranu 2023-08-27 195351

How did you install NormCap?

MSI installer (Windows)

Operating System + Version?

Windows 10/11

[Linux only] Display Server (DS) + Desktop environment (DE)?

No response

Debug log output?*

No response

Pawwwle avatar Sep 08 '23 07:09 Pawwwle

Hi @Pawwwle, what you experience is NormCap's "parse" mode, which tries to detect certain common text layouts and automatically (re-)format the text accordingly.

In your example, NormCap does detect the selected text as a "Paragraph" (multiple lines of continuous texts). In such a situation, you usually don't want to preserve the line breaks, that's why they get removed.

However, in your case, this is a false detection, instead NormCap should have detected the text as "Multiline" (multiple lines of text, not continuous, like lists). Then it would have preserved the line breaks.

Workaround: As a short term solution, when the "parse" mode fails and returns unexpected results, try switching NormCap's "Capture Mode" to "raw", which will output the text exactly as detected by Tesseract (including line breaks):



The desired solution: (to be implemented) The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

  • Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
  • Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

dynobo avatar Sep 19 '23 16:09 dynobo

I ran further tests. Unfortunately, neither mode reflects the original text layout.

Zrzut ekranu 2023-09-19 201415

Zrzut ekranu 2023-09-19 201516

Pawwwle avatar Sep 19 '23 18:09 Pawwwle

Thanks for trying, @Pawwwle. This in deed seems odd, I'll take a look at it and hopefully get it improved a bit for the 0.5.0 final version :slightly_smiling_face:

dynobo avatar Oct 08 '23 21:10 dynobo

Is was able to identify an issue in "raw"-mode which caused some missing line-breaks. Also, I was able to improve the paragraph parsing a bit:

With https://github.com/dynobo/normcap/pull/552/commits/8ad2f6ab3957686d85cd285354bcf557f8a7ac1a, when detecting the image ... test image

... then "raw"-mode comes quite close to original layout:

The desired solution: (to be implemented)
The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

« Lines of similar length should indicate "Paragraphs”, different length indicate "Multilines"

« Relatively small gaps between lines should indicate "Paragraphs”, larger gaps between lines indicate "Multilines"

... while "parse"-mode does swallow the first line-break:

The desired solution: (to be implemented) The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:
« Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
« Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

This is not ideal here, but in most cases you don't want to preserve such intra-paragraph line-breaks, as those should be added by the application you are pasting the text into, depending on the supported line-width. So I guess this is fine.

The results for the example picture in the initial bug report are also a bit better, but Tesseract detects a intra-paragraph line-break between 1. and 2. bullet point, while a paragraph break is detected between 2. and 3. bullet point:

"parse"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie
E kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy G sprzedajesz z darmową dostawą do Paczkomatów
0 zwiększasz atrakcyjność swoich ogłoszeń, dzięki oznaczeniu Smart!

"raw"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie

Eb kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy
G sprzedajesz z darmową dostawą do Paczkomatów

9 zwiększasz atrakcyjność swoich ogłoszeń, dzieki oznaczeniu Smart!

I'm afraid, this is as good as I can get it for now, as this is a limitation of Tesseract... :slightly_frowning_face:

dynobo avatar Nov 05 '23 02:11 dynobo