datapusher-plus icon indicating copy to clipboard operation
datapusher-plus copied to clipboard

Upload to Datastore errors when DOWNLOAD_PREVIEW_ONLY=True

Open twdbben opened this issue 1 year ago • 2 comments

Describe the bug

When using datapusher-plus-docker to run datapusher-plus, with the following config parameters set:

PREVIEW_ROWS=1000 ADD_SUMMARY_STATS_RESOURCE=True SUMMARY_STATS_WITH_PREVIEW=True DOWNLOAD_PREVIEW_ONLY=True

There appears to be a problem with having DOWNLOAD_PREVIEW_ONLY=True

Setting DOWNLOAD_PREVIEW_ONLY=False fixes the errors I'm seeing.

With DOWNLOAD_PREVIEW_ONLY=True, when I try to push a resource to DP+, I get errors.

These are my test resource files that are failing:

TCEQ-TEST.xlsx TCEQ-TEST.csv

When I push the attached XLSX file, I get the this error:

datapusher-plus  | --- Logging error ---
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 630, in push_to_datastore
datapusher-plus  |     qsv_excel = subprocess.run(
datapusher-plus  |   File "/usr/lib/python3.10/subprocess.py", line 524, in run
datapusher-plus  |     raise CalledProcessError(retcode, process.args,
datapusher-plus  | subprocess.CalledProcessError: Command '['/usr/local/bin/qsvdp', 'excel', '/tmp/tmp8s4qgo7c.XLSX', '--sheet', '
0', '--trim', '--output', '/tmp/tmp7ns3tj6h.csv']' returned non-zero exit status 1.                                               datapusher-plus  |
datapusher-plus  | During handling of the above exception, another exception occurred:
datapusher-plus  |
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/usr/lib/python3.10/logging/handlers.py", line 1057, in emit                                           datapusher-plus  |     smtp = smtplib.SMTP(self.mailhost, port, timeout=self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 255, in __init__
datapusher-plus  |     (code, msg) = self.connect(host, port)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 341, in connect
datapusher-plus  |     self.sock = self._get_socket(host, port, self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 312, in _get_socket
datapusher-plus  |     return socket.create_connection((host, port), timeout,
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 845, in create_connection
datapusher-plus  |     raise err
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 833, in create_connection
datapusher-plus  |     sock.connect(sa)
datapusher-plus  | ConnectionRefusedError: [Errno 111] Connection refused
datapusher-plus  | Call stack:
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
datapusher-plus  |     self._bootstrap_inner()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
datapusher-plus  |     self.run()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 953, in run
datapusher-plus  |     self._target(*self._args, **self._kwargs)
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
datapusher-plus  |     work_item.run()
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
datapusher-plus  |     result = self.fn(*self.args, **self.kwargs)
datapusher-plus  |   File "/usr/lib/ckan/dpplus_venv/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run
_job
datapusher-plus  |     retval = job.func(*job.args, **job.kwargs)
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 646, in push_to_datastore
datapusher-plus  |     logger.error(
datapusher-plus  | Message: "Upload aborted. Cannot export spreadsheet(?) to CSV: Command '['/usr/local/bin/qsvdp', 'excel', '/tmp
/tmp8s4qgo7c.XLSX', '--sheet', '0', '--trim', '--output', '/tmp/tmp7ns3tj6h.csv']' returned non-zero exit status 1."
datapusher-plus  | Arguments: ()
datapusher-plus  | 2023-06-26 16:53:33,176 WARNING Is the file encrypted or is not a spreadsheet?
datapusher-plus  | FILE ATTRIBUTES: /tmp/tmp8s4qgo7c.XLSX: Microsoft Excel 2007+

When I try with the same XLSX file converted to a CSV, I get the following error:

datapusher-plus  | 2023-06-26 17:02:57,076 INFO Fetching from: http://192.168.7.200:5000/dataset/8cbdffdb-1cef-4c9d-84fd-005fde129
962/resource/9af29c46-4f37-4c8f-9021-09bf7af88f9b/download/tceq-test.csv...
datapusher-plus  | 127.0.0.1 - - [26/Jun/2023:17:02:57 +0000] "GET /job/3b1c2e8d-29de-4d65-87b8-e3d800129cfe HTTP/1.1" 200 1111 "-
" "python-requests/2.25.1"
datapusher-plus  | 2023-06-26 17:02:57,161 INFO Downloading only first 1,000 row preview from 5.31MB file...
datapusher-plus  | 2023-06-26 17:02:57,170 INFO Fetched 0.09MB file in 0.09 seconds.
datapusher-plus  | 2023-06-26 17:02:57,177 INFO ANALYZING WITH QSV..
datapusher-plus  | 2023-06-26 17:02:57,184 INFO Normalizing/UTF-8 transcoding CSV...
datapusher-plus  | Invalid CSV. Last valid row (4): CSV error: record 4 (line: 5, byte: 446): found record with 23 fields, but the
 previous record has 3 fields
datapusher-plus  | 2023-06-26 17:02:57,237 ERROR Job aborted as the file cannot be normalized/transcoded: Command '['/usr/local/bi
n/qsvdp', 'input', '/tmp/tmpso60e8jy..csv', '--trim-headers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1
..
datapusher-plus  | --- Logging error ---
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 692, in push_to_datastore
datapusher-plus  |     subprocess.run(
datapusher-plus  |   File "/usr/lib/python3.10/subprocess.py", line 524, in run
datapusher-plus  |     raise CalledProcessError(retcode, process.args,
datapusher-plus  | subprocess.CalledProcessError: Command '['/usr/local/bin/qsvdp', 'input', '/tmp/tmpso60e8jy..csv', '--trim-head
ers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1.
datapusher-plus  |
datapusher-plus  | During handling of the above exception, another exception occurred:
datapusher-plus  |
datapusher-plus  | Traceback (most recent call last):
datapusher-plus  |   File "/usr/lib/python3.10/logging/handlers.py", line 1057, in emit
datapusher-plus  |     smtp = smtplib.SMTP(self.mailhost, port, timeout=self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 255, in __init__
datapusher-plus  |     (code, msg) = self.connect(host, port)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 341, in connect
datapusher-plus  |     self.sock = self._get_socket(host, port, self.timeout)
datapusher-plus  |   File "/usr/lib/python3.10/smtplib.py", line 312, in _get_socket
datapusher-plus  |     return socket.create_connection((host, port), timeout,
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 845, in create_connection
datapusher-plus  |     raise err
datapusher-plus  |   File "/usr/lib/python3.10/socket.py", line 833, in create_connection
datapusher-plus  |     sock.connect(sa)
datapusher-plus  | ConnectionRefusedError: [Errno 111] Connection refused
datapusher-plus  | Call stack:
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 973, in _bootstrap
datapusher-plus  |     self._bootstrap_inner()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
datapusher-plus  |     self.run()
datapusher-plus  |   File "/usr/lib/python3.10/threading.py", line 953, in run
datapusher-plus  |     self._target(*self._args, **self._kwargs)
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
datapusher-plus  |     work_item.run()
datapusher-plus  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
datapusher-plus  |     result = self.fn(*self.args, **self.kwargs)
datapusher-plus  |   File "/usr/lib/ckan/dpplus_venv/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run
_job
datapusher-plus  |     retval = job.func(*job.args, **job.kwargs)
datapusher-plus  |   File "/srv/app/src/datapusher-plus/datapusher/jobs.py", line 706, in push_to_datastore
datapusher-plus  |     logger.error(
datapusher-plus  | Message: "Job aborted as the file cannot be normalized/transcoded: Command '['/usr/local/bin/qsvdp', 'input', '
/tmp/tmpso60e8jy..csv', '--trim-headers', '--output', '/tmp/tmp3zaav2od.csv']' returned non-zero exit status 1.."
datapusher-plus  | Arguments: ()
datapusher-plus  | 127.0.0.1 - - [26/Jun/2023:17:03:01 +0000] "GET /job/3b1c2e8d-29de-4d65-87b8-e3d800129cfe HTTP/1.1" 200 2217 "-
" "python-requests/2.25.1"

twdbben avatar Jun 26 '23 17:06 twdbben