papermage icon indicating copy to clipboard operation
papermage copied to clipboard

problems when parsing older paper in PDF format

Open XueliPan opened this issue 1 year ago • 2 comments

Hi, thanks for this great toolkit! I tried the papermage with several PDF files. It works really well with recent papers but when I tried to parse some papers published in 1980 or 1989, papermage failed to parse the sentences.

doc = recipe.run("1980.pdf")
for sen in doc.sentences:
    print(sen.text)
'''
output:
Received
January
1978;
revised
October
1979;
accepted
December 1979
References
1.
Avery,
K.
R.
,
and
Avery,
C.
A.
Design
and
development
of an interactive
statistical
system
(SIPS).
Proc.
Comptr.
Sci.
and
Statistics: 8th
Ann.
Symp.
on
'''

XueliPan avatar Dec 15 '23 21:12 XueliPan

Interesting! could you send me the PDF so I can have a look at it? older PDFs not something we really investigated much

kyleclo avatar Dec 19 '23 01:12 kyleclo

1980.pdf 1989.pdf These are the two PDF files that I have tested. Thanks!

XueliPan avatar Dec 19 '23 11:12 XueliPan