tika-python icon indicating copy to clipboard operation
tika-python copied to clipboard

Issues with Landscape PDFs

Open reisner opened this issue 3 years ago • 0 comments

Hi there,

I see a parsing issue with landscape PDFS. For example, This one.

When I run

parser.from_file("https://pub-edmonton.escribemeetings.com/filestream.ashx?DocumentId=24237")['content']

I get a bunch of short words that look like:

...
nm\ne\nn\nt \n2\n \n\n P\na\ng\ne\n 1\n o\nf \n5\n \n\nR\ne\np\no\nrt\n: \nC\nR\n_\n2\n9\n9\n2\n \n\n M\na\ntu\nre\n N\ne\nig\nh\nb\no\nu\nrh\no\no\nd\n O\nv\ne\nrl\na\ny\n R\ne\ng\nu\nla\nti\no\nn\ns\n \n\n 81\n4.\n1 \nG\nen\ner\nal\n P\nur\npo\nse\n: \nT\nhe\n p\nur\npo\nse\n o\nf \nth\nis\n O\nve\nrl\nay\n is\n to\n\n e\nns\nur\ne \nth\nat\n n\new\n\n l\now\n\n d\nen\nsi\nty\n d\nev\nel\nop\nm\nen\nt \nin\n E\ndm\n\non\nto\nn’\ns \nm\nat\nur\ne \n\nre\nsi\nde\nnt\nia\nl n\n\nei\ngh\nbo\nur\nho\nod\n\ns \nis\n s\nen\nsi\nti\nve\n in\n\n s\nca\nle\n to\n\n e\nxi\nst\nin\ng \nde\nve\nlo\npm\n\nen\nt, \nm\nai\nnt\nai\nns\n th\n\ne \ntr\nad\nit\nio\nna\nl \n\nch\nar\nac\nte\nr \nan\nd \npe\nde\nst\nri\nan\n-f\nri\nen\ndl\ny \nde\nsi\ngn\n o\nf \nth\ne \nst\nre\net\nsc\nap\ne,\n e\nns\nur\nes\n p\nri\nva\ncy\n a\nnd\n s\nun\nli\ngh\nt p\n\nen\net\nra\nti\non\n o\nn \n\nad\nja\nce\nnt\n p\nro\npe\nrt\nie\ns \nan\nd \npr\nov\nid\nes\n o\npp\nor\ntu\nni\nty\n f\nor\n d\nis\ncu\nss\nio\nn \nbe\ntw\nee\nn \nap\npl\nic\nan\nts\n a\nnd\n n\nei\ngh\nbo\nur\nin\ng \n\naf\nfe\nct\ned\n p\nar\nti\nes\n w\nhe\nn \na \nde\nve\nlo\npm\n\nen\nt p\n\nro\npo\nse\ns \nto\n v\nar\ny \nth\ne \nO\nve\nrl\nay\n r\neg\nul\nat\nio\nns\n.  \n\n 81\n4.\n3 \nD\
...

The text is there, it just seems to have word breaks in the middle of words, resulting in 2-character word chunks. Is there a special setting for Landscape PDFs?

reisner avatar May 17 '21 19:05 reisner