pdf-to-markdown
pdf-to-markdown copied to clipboard
Support Python 3.x
Python 2 is going to be deprecated; let's support Python 3.x.
Some issues were pointed out in https://github.com/johnlinp/pdf-to-markdown/issues/17#issuecomment-509132956
converted existing code base to python3 using 2to3 and installed the dist and tried running. It gives an error
Traceback (most recent call last):
File "/usr/local/bin/pdf2md", line 4, in <module>
__import__('pkg_resources').run_script('pdf-to-markdown==0.1.0', 'pdf2md')
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1469, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/EGG-INFO/scripts/pdf2md", line 32, in <module>
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/EGG-INFO/scripts/pdf2md", line 27, in main
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/writer.py", line 27, in write
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/writer.py", line 50, in _write_simple
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/pile.py", line 74, in gen_markdown
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/pile.py", line 266, in _gen_paragraph_markdown
File "/usr/local/lib/python3.7/dist-packages/pdf_to_markdown-0.1.0-py3.7.egg/pdf2md/syntax.py", line 47, in pattern
File "/usr/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object
i thought maybe something with re.match or re.search but i guess the content is not getting as string but as bytes format. some encoding and decode issue when parsing with only english text also.
TypeError: can only concatenate str (not "bytes") to str
I just was hoping to inform about error nothing else, i might try to work on it when i have some time
i am not sure if this is correct way to do it but .decode(encoding="utf-8")
fixes it and the extension works perfect with all files including the example file in repo.
Hi @nidhi-wgl,
According to @nella17's PR (#22), we can see that simply removing the .encode('utf8')
part should work. Please see https://github.com/johnlinp/pdf-to-markdown/pull/22/commits/6791abf93da7c2aa79ab3e7cd4ae87957bcae271.
Thanks @nella17!
yeah, that is also one way around. I didn't want to remove .encode
or any exiting code so I was proposing to add the decode line if anyone wanted to run the code in python3.