langchainjs icon indicating copy to clipboard operation
langchainjs copied to clipboard

Improves pdf parsing

Open Nishkalkashyap opened this issue 2 years ago • 5 comments

The below changes improve upon PDFLoader.parse method. Instead of adding a newline character after every TextItem, we only add a newline if the TextItem has an EOL character. This significantly improves the output text. Attaching an example below -

Before:

Contents

Foreword
 
iii
Preface
 
v

1.
 
Real Numbers
 
1

1.1
 
Introduction
 
1
1.2
 
Euclid's Division Lemma
 
2
1.3
 
The Fundamental Theorem of Arithmetic
 
7
1.4
 
Revisiting Irrational Numbers
 
11
1.5
 
Revisiting Rational Numbers and Their Decimal Expansions
 
15
1.6
 
Summary

After:

Contents
Foreword iii
Preface v
1. Real Numbers 1
1.1 Introduction 1
1.2 Euclid's Division Lemma 2
1.3 The Fundamental Theorem of Arithmetic 7
1.4 Revisiting Irrational Numbers 11
1.5 Revisiting Rational Numbers and Their Decimal Expansions 15
1.6 Summary 18

Nishkalkashyap avatar Apr 22 '23 05:04 Nishkalkashyap

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 22, 2023 6:11pm

vercel[bot] avatar Apr 22 '23 05:04 vercel[bot]

That doesn't seem to generalise to all pdfs, see eg the output for this pdf https://arxiv.org/pdf/1706.03762.pdf

Screenshot 2023-04-22 at 18 58 20

nfcampos avatar Apr 22 '23 17:04 nfcampos

I've committed here the solution used by https://gitlab.com/autokent/pdf-parse, do you want to try on your pdf and see if that's better?

nfcampos avatar Apr 22 '23 18:04 nfcampos

That doesn't seem to generalise to all pdfs, see eg the output for this pdf https://arxiv.org/pdf/1706.03762.pdf

Are you sure you ran the PDF against my commit? Because I executed it against the PDF you mentioned (https://arxiv.org/pdf/1706.03762.pdf) and I'm getting a different result, as shown below.

Perhaps your console is not rendering the \n character and that's why the result seems off?

image

Nishkalkashyap avatar Apr 23 '23 10:04 Nishkalkashyap

I've committed here the solution used by https://gitlab.com/autokent/pdf-parse, do you want to try on your pdf and see if that's better?

Unfortunately, the solution that you have committed doesn't seem to work for my pdfs. Example - red side of the diff editor is the output from my commit and green side would be from your commit.

image

Nishkalkashyap avatar Apr 23 '23 10:04 Nishkalkashyap