unipdf
unipdf copied to clipboard
[BUG] text extraction for list not work as exptected
Description
the text in pdf list on the same line does not on the same line after extract to text.
for example:
6. Miu to kekkon shitai desu.
becomes
6.
Miu to kekkon shitai desu.
Expected Behavior
text extraction for list works OK, the result should be:
6. Miu to kekkon shitai desu.
7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun)
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu.
9. O, o-jō-san o kudasai.
10. TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?
11. Soretomo uchi no kaisha ga hoshii no ka?
12. TENŌ MIU: Papa!
the second list result in:
ENGLISH
1. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
2. TENZO RAIZO: Huh?
3. ŌZORAHARUYA: ah, um...sir, please give me your daughter.
4. TENDOU RAIZO: Huh?
5. ŌZORA HARUYA: I...I want to be with Miu forever.
6. I want to marry Miu.
7. TENDO RAIZO: I? Miu? (harrumph)
8. ŌZORA HARUYA: I...I'm sorry. I want to marry Miu.
Actual Behavior
the list extraction is buggy
for example:
the first list extracted result in
6.
Miu to kekkon shitai desu.
7. TENDŌ RAIZŌ: Ore.... Miu .... (Fun)
8. ŌZOYA HARUYA: A... sumimasen. Boku wa Miu-san to kekkon shitai desu.
9.
10.
11.
12.
O, o-jō-san o kudasai.
TENDŌ RAIZŌ: Kimi wa... Nani ga hoshii? Kane ka? Ie ka? A?
Soretomo uchi no kaisha ga hoshii no ka?
TENŌ MIU: Papa!
the second list result in:
ENGLISH
1. ŌZORA HARUYA:
2. TENZO RAIZO:
3. ŌZORAHARUYA:
4. TENDOU RAIZO:
5. ŌZORA HARUYA:
6. I want to marry Miu.
7. TENDO RAIZO:
8. ŌZORA HARUYA: (Tendō Family)Ah, nice to meet you. I am Ōzora Haruya.
Huh?
ah, um...sir, please give me your daughter.
Huh?
I...I want to be with Miu forever.
I? Miu? (harrumph)
I...I'm sorry. I want to marry Miu.
Attachments
Include a self-contained reproducible code snippet and PDF file that demonstrates the issue. B_S4L4_p4_github.pdf
package main
import (
"fmt"
"os"
"github.com/unidoc/unipdf/v3/extractor"
"github.com/unidoc/unipdf/v3/model"
)
func main() {
if err := outputPdfText(os.Args[1]); err != nil {
panic(err)
}
}
// outputPdfText prints out contents of PDF file to stdout.
func outputPdfText(inputPath string) error {
f, err := os.Open(inputPath)
if err != nil {
return err
}
defer f.Close()
pdfReader, err := model.NewPdfReader(f)
if err != nil {
return err
}
numPages, err := pdfReader.GetNumPages()
if err != nil {
return err
}
fmt.Printf("--------------------\n")
fmt.Printf("PDF to text extraction:\n")
fmt.Printf("--------------------\n")
for i := 0; i < numPages; i++ {
pageNum := i + 1
page, err := pdfReader.GetPage(pageNum)
if err != nil {
return err
}
ex, err := extractor.New(page)
if err != nil {
return err
}
pt, _, _, err := ex.ExtractPageText()
if err != nil {
return err
}
text := pt.Text()
// text, err := ex.ExtractText()
// if err != nil {
// return err
// }
fmt.Println("------------------------------")
fmt.Printf("Page %d:\n", pageNum)
fmt.Printf("\"%s\"\n", text)
fmt.Println("------------------------------")
}
return nil
}