DocsGPT
DocsGPT copied to clipboard
🐛 Bug Report: EmptyFileError occurs every time
📜 Description
Each time I tries to upload file and train, the log in backend says:
docsgpt-worker-1 | [2023-10-07 13:06:41,814: INFO/MainProcess] Task application.api.user.tasks.ingest[7334f70e-d91b-480c-bc4c-29bf63104297] received
docsgpt-worker-1 | [2023-10-07 13:06:41,815: WARNING/ForkPoolWorker-6] inputs/local/CirC.pdf
docsgpt-worker-1 | [2023-10-07 13:06:43,083: WARNING/ForkPoolWorker-6] <Response [502]>
docsgpt-worker-1 | [2023-10-07 13:06:43,088: ERROR/ForkPoolWorker-6] Task application.api.user.tasks.ingest[7334f70e-d91b-480c-bc4c-29bf63104297] raised unexpected: EmptyFileError('Cannot read an empty file')
docsgpt-worker-1 | Traceback (most recent call last):
docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
docsgpt-worker-1 | R = retval = fun(*args, **kwargs)
docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
docsgpt-worker-1 | return self.run(*args, **kwargs)
docsgpt-worker-1 | File "/app/application/api/user/tasks.py", line 6, in ingest
docsgpt-worker-1 | resp = ingest_worker(self, directory, formats, name_job, filename, user)
docsgpt-worker-1 | File "/app/application/worker.py", line 73, in ingest_worker
docsgpt-worker-1 | exclude_hidden=exclude, file_metadata=metadata_from_filename).load_data()
docsgpt-worker-1 | File "/app/application/parser/file/bulk.py", line 146, in load_data
docsgpt-worker-1 | data = parser.parse_file(input_file, errors=self.errors)
docsgpt-worker-1 | File "/app/application/parser/file/docs_parser.py", line 28, in parse_file
docsgpt-worker-1 | pdf = PyPDF2.PdfReader(fp)
docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 319, in __init__
docsgpt-worker-1 | self.read(stream)
docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1414, in read
docsgpt-worker-1 | self._basic_validation(stream)
docsgpt-worker-1 | File "/usr/local/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1455, in _basic_validation
docsgpt-worker-1 | raise EmptyFileError("Cannot read an empty file")
docsgpt-worker-1 | PyPDF2.errors.EmptyFileError: Cannot read an empty file
docsgpt-backend-1 | [2023-10-07 13:06:46 +0000] [8] [ERROR] Error handling request /api/task_status?task_id=7334f70e-d91b-480c-bc4c-29bf63104297
docsgpt-backend-1 | Traceback (most recent call last):
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
docsgpt-backend-1 | response = self.handle_exception(e)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
docsgpt-backend-1 | response = self.full_dispatch_request()
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
docsgpt-backend-1 | return self.finalize_request(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
docsgpt-backend-1 | response = self.make_response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
docsgpt-backend-1 | rv = self.json.response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
docsgpt-backend-1 | return json.dumps(obj, **kwargs)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
docsgpt-backend-1 | **kw).encode(obj)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
docsgpt-backend-1 | chunks = list(chunks)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
docsgpt-backend-1 | yield from chunks
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
docsgpt-backend-1 | o = _default(o)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
docsgpt-backend-1 | TypeError: Object of type EmptyFileError is not JSON serializable
docsgpt-backend-1 | [2023-10-07 13:06:46 +0000] [8] [ERROR] Error handling request /api/task_status?task_id=7334f70e-d91b-480c-bc4c-29bf63104297
docsgpt-backend-1 | Traceback (most recent call last):
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
docsgpt-backend-1 | response = self.handle_exception(e)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
docsgpt-backend-1 | response = self.full_dispatch_request()
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
docsgpt-backend-1 | return self.finalize_request(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
docsgpt-backend-1 | response = self.make_response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
docsgpt-backend-1 | rv = self.json.response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
docsgpt-backend-1 | return json.dumps(obj, **kwargs)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
docsgpt-backend-1 | **kw).encode(obj)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
docsgpt-backend-1 | chunks = list(chunks)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
docsgpt-backend-1 | yield from chunks
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
docsgpt-backend-1 | o = _default(o)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
docsgpt-backend-1 | TypeError: Object of type EmptyFileError is not JSON serializable
docsgpt-backend-1 | [2023-10-07 13:06:46 +0000] [8] [ERROR] Error handling request /api/task_status?task_id=7334f70e-d91b-480c-bc4c-29bf63104297
docsgpt-backend-1 | Traceback (most recent call last):
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 136, in handle
docsgpt-backend-1 | self.handle_request(listener, req, client, addr)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
docsgpt-backend-1 | respiter = self.wsgi(environ, resp.start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2552, in __call__
docsgpt-backend-1 | return self.wsgi_app(environ, start_response)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2532, in wsgi_app
docsgpt-backend-1 | response = self.handle_exception(e)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
docsgpt-backend-1 | response = self.full_dispatch_request()
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1826, in full_dispatch_request
docsgpt-backend-1 | return self.finalize_request(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1845, in finalize_request
docsgpt-backend-1 | response = self.make_response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2157, in make_response
docsgpt-backend-1 | rv = self.json.response(rv)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 309, in response
docsgpt-backend-1 | f"{self.dumps(obj, **dump_args)}\n", mimetype=mimetype
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 230, in dumps
docsgpt-backend-1 | return json.dumps(obj, **kwargs)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/__init__.py", line 238, in dumps
docsgpt-backend-1 | **kw).encode(obj)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 201, in encode
docsgpt-backend-1 | chunks = list(chunks)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 431, in _iterencode
docsgpt-backend-1 | yield from _iterencode_dict(o, _current_indent_level)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
docsgpt-backend-1 | yield from chunks
docsgpt-backend-1 | File "/usr/local/lib/python3.10/json/encoder.py", line 438, in _iterencode
docsgpt-backend-1 | o = _default(o)
docsgpt-backend-1 | File "/usr/local/lib/python3.10/site-packages/flask/json/provider.py", line 122, in _default
docsgpt-backend-1 | raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable")
docsgpt-backend-1 | TypeError: Object of type EmptyFileError is not JSON serializable
I double checked the worker whether the file I upload exists, and it does.
# pwd
/app/application/inputs/local
# ls
CirC.pdf Silph.pdf
# cd CirC.pdf
# ls -l
total 716
-rw-r--r-- 1 root root 733100 Oct 7 13:06 CirC.pdf
BTW, because of the network condition, I modified the docker-compose.yaml in order to use proxy. So I added http_proxy, https_proxy, all_proxy into environment in backend and worker. I'm sure this may not introduce new bugs since I can use the chat without error.
👟 Reproduction steps
./run-with-docker-compose.sh
open webUI and upload pdf.
click train
👍 Expected behavior
successfully trained
👎 Actual Behavior with Screenshots
At frontend: training... 0% At backend: error log above
💻 Operating system
MacOS
What browsers are you seeing the problem on?
Microsoft Edge
🤖 What development environment are you experiencing this bug on?
Docker
🔒 Did you set the correct environment variables in the right path? List the environment variable names (not values please!)
follow the setup instructions. plus http_proxy, https_proxy, all_proxy
📃 Provide any additional context for the Bug.
No response
📖 Relevant log output
No response
👀 Have you spent some time to check if this bug has been raised before?
- [X] I checked and didn't find similar issue
🔗 Are you willing to submit PR?
None
🧑⚖️ Code of Conduct
- [X] I agree to follow this project's Code of Conduct
It seems, after clicked choose files button and selected a file, it doesn't actually uploaded to backend. And when clicked Train button, the uploading and training procedure happens together, which makes unpredictable result.
Ok so currently when you choose file its just frontend, once you click train it submits the file and starts a training job.
But whats interesting is why its crashing for you.
Can you explain it in more detail please.
Did you manage to fix it? or the error occurs only sometimes? Maybe we need to introduce some delay?
Yes, it occurs every time. I think introducing some delay may solve this problem. But I didn't read the source code yet and I'm sorry that I am not able to fix it.
The details are explained the same as you say. Choosing file is just frontend, and click training backend tells me this is an empty file. I am sure it's not empty.
Same thing happenes on another Windows computer.
---Original--- From: @.> Date: Sun, Oct 8, 2023 06:34 AM To: @.>; Cc: @.@.>; Subject: Re: [arc53/DocsGPT] 🐛 Bug Report: EmptyFileError occurs every time (Issue #490)
Ok so currently when you choose file its just frontend, once you click train it submits the file and starts a training job.
But whats interesting is why its crashing for you.
Can you explain it in more detail please.
Did you manage to fix it? or the error occurs only sometimes? Maybe we need to introduce some delay?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I want to work on this.
@parag477 please try replicating it, thank you!
Thanks a lot! @parag477 @dartpain