tika-python
tika-python copied to clipboard
How to handles cases where if I iterate over 100k files at once it fails after parsing a large number?
I'm using apache tika python client to parse pdf files but in my case I have more than a million documents. I think tika has some limitation where after parsing some 100k files then it starts to fail to parse new pdfs when we do,
from tika import parser
parsed = parser.from_file('/path/to/file')
Is this a common issue? How can I handle it? Is it possible to restart tika directly from my python code and make it work? Please help me