gImageReader icon indicating copy to clipboard operation
gImageReader copied to clipboard

How to split in single page

Open TeoColuccio opened this issue 4 years ago • 8 comments

I've this page, can I split this A3 scan in 2 A4, during the export in pdf? Schermata da 2020-04-06 16-16-41

TeoColuccio avatar Apr 06 '20 14:04 TeoColuccio

No this is currently not possible.

manisandro avatar Apr 06 '20 14:04 manisandro

This would indeed be a very useful feature

TeoColuccio avatar Apr 06 '20 14:04 TeoColuccio

It would probably go into the chapter of defining a manual page layout before recognition, it's much easier than attempting to fix up the layout afterwards.

manisandro avatar Apr 06 '20 20:04 manisandro

Ok, so I wrote an issue also on tesseract forum -> https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/TcbG4-vB8NM

For now, a work around I found is to first split each page with pdf arranger and then use gimagereader.

TeoColuccio avatar Apr 07 '20 07:04 TeoColuccio

Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here: https://github.com/4lex4/scantailor-advanced

Jossi2 avatar May 14 '20 20:05 Jossi2

Ok i'll try it. Until now, I used the simply pdf arranger that do it's work as well

TeoColuccio avatar May 15 '20 16:05 TeoColuccio

Prior to OCR with gImageReader, I preprocess all scans using ScanTailor Advanced. It has all the functionality asked for here (and much more). I recommend this workflow highly, as OCR results and visual looks of the resulting documents are much, much better. You will find it here:

Question: ScanTailor has in the output-filter the section "mode->mixed" that creates a "picture zones layer" (tab "picture zones").

If I understand right, then ScanTailor includes two layers, when creating the tiff file as output: a) picture layer b) text layer

Do you use this function of ScanTailor?

It looks like ScanTailor can better separate text from pictures in books, that have a mixture of pictures and text, than gimagereader/tesseract does.

I guess, that when we take such a tiff, generated from ScanTailor, gimagereader does only take the "text layer" to apply OCR. So the result would be better. Does it? Does gimagereader support the text-layer and picture layer output from ScanTailor-tiff files?

Would be very interested in an answer. Thank you.

Golddouble avatar Jul 04 '22 18:07 Golddouble

Sorry, but this is a function of ScanTailor Advanced I've never used so far, and I'm not sure how it works at all. For my purposes, a layer consisting of the (processed) scan image of the whole page plus the text layer produced by OCR have always been sufficient as I have no need to extract individual pictures from the finished PDF later on.

Jossi2 avatar Jul 04 '22 20:07 Jossi2