unipdf icon indicating copy to clipboard operation
unipdf copied to clipboard

[BUG] text extraction for list not work as exptected

Open traitman opened this issue 1 year ago • 2 comments

Description

the text in pdf list on the same line does not on the same line after extract to text.

for example:

6. Miu to kekkon shitai desu.

becomes

6.

Miu to kekkon shitai desu.

Expected Behavior

text extraction for list works OK, the result should be:

6. Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9. O, o-jō-san o kudasai.

10. TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

11. Soretomo uchi no kaisha ga hoshii no ka?

12. TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
2. TENZO RAIZO: Huh?
3. ŌZORAHARUYA: ah, um...sir, please give me your daughter.
4. TENDOU RAIZO: Huh?
5. ŌZORA HARUYA: I...I want to be with Miu forever.
6. I want to marry Miu. 
7. TENDO RAIZO: I? Miu? (harrumph)
8. ŌZORA HARUYA: I...I'm sorry. I want to marry Miu. 

Actual Behavior

the list extraction is buggy

for example: image

the first list extracted result in

6.

Miu to kekkon shitai desu.

7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun) 
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu. 

9.

10.

11.

12.

O, o-jō-san o kudasai.

TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?

Soretomo uchi no kaisha ga hoshii no ka?

TENŌ MIU: Papa!

the second list result in:

ENGLISH

1. ŌZORA HARUYA: 
2. TENZO RAIZO: 
3. ŌZORAHARUYA: 
4. TENDOU RAIZO: 
5. ŌZORA HARUYA: 
6. I want to marry Miu. 
7. TENDO RAIZO: 
8. ŌZORA HARUYA:  (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.

Huh?

ah, um...sir, please give me your daughter.

Huh?

I...I want to be with Miu forever.

I? Miu? (harrumph)

I...I'm sorry. I want to marry Miu.

Attachments

Include a self-contained reproducible code snippet and PDF file that demonstrates the issue. B_S4L4_p4_github.pdf

package main

import (
	"fmt"
	"os"

	"github.com/unidoc/unipdf/v3/extractor"
	"github.com/unidoc/unipdf/v3/model"
)

func main() {
	if err := outputPdfText(os.Args[1]); err != nil {
		panic(err)
	}
}

// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
	f, err := os.Open(inputPath)
	if err != nil {
		return err
	}

	defer f.Close()

	pdfReader, err := model.NewPdfReader(f)
	if err != nil {
		return err
	}

	numPages, err := pdfReader.GetNumPages()
	if err != nil {
		return err
	}

	fmt.Printf("--------------------\n")
	fmt.Printf("PDF to text extraction:\n")
	fmt.Printf("--------------------\n")
	for i := 0; i < numPages; i++ {
		pageNum := i + 1

		page, err := pdfReader.GetPage(pageNum)
		if err != nil {
			return err
		}

		ex, err := extractor.New(page)
		if err != nil {
			return err
		}

		pt, _, _, err := ex.ExtractPageText()
		if err != nil {
			return err
		}

		text := pt.Text()
		// text, err := ex.ExtractText()
		// if err != nil {
		// 	return err
		// }

		fmt.Println("------------------------------")
		fmt.Printf("Page %d:\n", pageNum)
		fmt.Printf("\"%s\"\n", text)
		fmt.Println("------------------------------")
	}

	return nil
}

traitman avatar Jan 25 '23 15:01 traitman