Parsr
Parsr copied to clipboard
Error in TableDetection2 script: FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'
To reproduce: Run the V1.1.0 Docker image and try to extract tables with TableDetection2 enabled.
parsr_1 | File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 85, in _run
parsr_1 | check=True,
parsr_1 | File "/usr/lib/python3.7/subprocess.py", line 472, in run
parsr_1 | with Popen(*popenargs, **kwargs) as process:
parsr_1 | File "/usr/lib/python3.7/subprocess.py", line 775, in __init__
parsr_1 | restore_signals, start_new_session)
parsr_1 | File "/usr/lib/python3.7/subprocess.py", line 1522, in _execute_child
parsr_1 | raise child_exception_type(errno_num, err_msg, err_filename)
parsr_1 | FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'
parsr_1 |
parsr_1 | During handling of the above exception, another exception occurred:
parsr_1 |
parsr_1 | Traceback (most recent call last):
parsr_1 | File "/opt/app-root/src/dist/assets/TableDetection2Script.py", line 212, in <module>
parsr_1 | main()
parsr_1 | File "/opt/app-root/src/dist/assets/TableDetection2Script.py", line 188, in main
parsr_1 | tables2 = tabula.read_pdf(pdf_file, stream=True, pages='all', output_format="json")
parsr_1 | File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 322, in read_pdf
parsr_1 | output = _run(java_options, kwargs, path, encoding)
parsr_1 | File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 91, in _run
parsr_1 | raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
parsr_1 | tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
parsr_1 |
Hello @jfilter,
Could you provide us the file you tried to parse?
It happens for me with all files that don't contain a table (and thus table1 does not find a table). Here en example. (It's a letter, but it's public information). My config:
{
"version": 0.9,
"extractor": {
"pdf": "pdfminer",
"ocr": "tesseract",
"language": ["deu"]
},
"cleaner": [
"drawing-detection",
[
"image-detection",
{
"ocrImages": false
}
],
"out-of-page-removal",
[
"whitespace-removal",
{
"minWidth": 0
}
],
[
"redundancy-detection",
{
"minOverlap": 0.5
}
],
[
"table-detection",
{
"runConfig": [
{
"pages": [],
"flavor": "lattice"
}
]
}
],
[
"table-detection-2",
{
"runConfig": [
{
"pages": []
}
]
}
],
[
"header-footer-detection",
{
"ignorePages": [],
"maxMarginPercentage": 15
}
],
"words-to-line-new",
[
"reading-order-detection",
{
"minVerticalGapWidth": 5,
"minColumnWidthInPagePercent": 15
}
],
[
"lines-to-paragraph",
{
"tolerance": 0.25
}
],
"page-number-detection",
"hierarchy-detection"
],
"output": {
"granularity": "word",
"includeMarginals": true,
"includeDrawings": true,
"formats": {
"json": true,
"text": false,
"csv": true,
"markdown": false,
"pdf": false,
"simpleJson": false
}
}
}
00014_012720_Stellungnahme_BV-Augen%C3%A4rzte_RefE__JVEG-%C3%84ndG.pdf
Ok, thanks. Which OS are you using?
I run it with the Docker image on Ubuntu and macOS.
I see exactly the same. Looks like there is no JDK in Parsr base docker
+1 to what @NadiaRom said, Java is not installed
Same here with the newest docker container when going through the official "Jupyter Notebook Demo" tutorial:
[2021-05-21T10:59:31] INFO (parsr-api/7 on d11c50a15136): executing command: python3 /opt/app-root/src/dist/assets/TableDetection2Script.py /tmp/cc7fc6b8253399c96cbef5f0a7107a.pdf all
[2021-05-21T10:59:33] INFO (parsr-api/7 on d11c50a15136): executing command error: Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 85, in _run
check=True,
File "/usr/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.7/subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/src/dist/assets/TableDetection2Script.py", line 212, in <module>
main()
File "/opt/app-root/src/dist/assets/TableDetection2Script.py", line 188, in main
tables2 = tabula.read_pdf(pdf_file, stream=True, pages='all', output_format="json")
File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 322, in read_pdf
output = _run(java_options, kwargs, path, encoding)
File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 91, in _run
raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
Same here in Windows using latest docker image from docker hub. Exactly the same place in the processes.
What is the best workaround for this? Should we modify the docker image to add a layer for installing/setting the path to Java?
@jbrry : If you comment out the 'table detection 2' part from the serverConfig.,json file you wont see this error..