[BUG] - Docker: Resource punkt not found
Description
I'm unable to launch the docker container after a clean pull
(WSL2, Docker 24.0.5, image digest sha256:d239cbf3733c58a065c516ab2d936929487cedf39098813d0aada47bbb540f07)
Reproduction steps
docker run -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it --rm taprosoft/kotaemon:v1.0
Screenshots
No response
Logs
Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/app' '/app'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 65, in __init__
nltk.data.find("tokenizers/punkt")
File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 579, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt
Searched in:
- '/root/nltk_data'
- '/usr/local/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/local/lib/python3.10/site-packages/llama_index/core/_static/nltk_cache'
**********************************************************************
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/app.py", line 13, in <module>
from ktem.main import App # noqa
File "/app/libs/ktem/ktem/main.py", line 2, in <module>
from ktem.app import BaseApp
File "/app/libs/ktem/ktem/app.py", line 8, in <module>
from ktem.components import reasonings
File "/app/libs/ktem/ktem/components.py", line 11, in <module>
from kotaemon.base import BaseComponent
File "/app/libs/kotaemon/kotaemon/base/__init__.py", line 1, in <module>
from .component import BaseComponent, Node, Param, lazy
File "/app/libs/kotaemon/kotaemon/base/component.py", line 6, in <module>
from kotaemon.base.schema import Document
File "/app/libs/kotaemon/kotaemon/base/schema.py", line 8, in <module>
from llama_index.core.bridge.pydantic import Field
File "/usr/local/lib/python3.10/site-packages/llama_index/core/__init__.py", line 10, in <module>
from llama_index.core.base.response.schema import Response
File "/usr/local/lib/python3.10/site-packages/llama_index/core/base/response/schema.py", line 9, in <module>
from llama_index.core.schema import NodeWithScore
File "/usr/local/lib/python3.10/site-packages/llama_index/core/schema.py", line 18, in <module>
from llama_index.core.utils import SAMPLE_TEXT, truncate_text
File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 89, in <module>
globals_helper = GlobalsHelper()
File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 67, in __init__
nltk.download("punkt_tab", download_dir=self._nltk_data_dir)
File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 774, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 629, in incr_download
info = self._info_or_id(info_or_id)
File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 603, in _info_or_id
return self.info(info_or_id)
File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 1006, in info
self._update_index()
File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 949, in _update_index
ElementTree.parse(urlopen(self._url)).getroot()
File "/usr/local/lib/python3.10/xml/etree/ElementTree.py", line 1222, in parse
tree.parse(source, parser)
File "/usr/local/lib/python3.10/xml/etree/ElementTree.py", line 580, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: unclosed token: line 80, column 4
Browsers
No response
OS
No response
Additional information
No response
same happens for me, is there any solution to this?
This is likely an issue with WSL 2 and the firewall settings of the host preventing the download triggered by nltk.
Not that i recommend disabling the firewall but a quick workaround might be it. See here for a more detailed guide.
I am on a corporate network, so it could be some firewall issue. However, loads of other Docker images can pull from the internet just fine (although Alpine is known to be sketchy with connectivity).
I had the same problem, with only Punkt not downloading and running correctly. But I have tried to solve this problem, I hope my ideas can help you.
-
Make sure your internet connection is OK, you can try ping google.com
-
Manually Download the NLTK Package. Once inside the container, run the following command to download and cache the NLTK data.
python -c "import nltk; nltk.download('punkt')" -
After confirming the data is cached, exit the container. Then, save the container as a new image to avoid re-downloading the data in the future.
docker commit <container_id> taprosoft/kotaemon:v1.1 -
From now on, you can use the new image
taprosoft/kotaemon:v1.1to start containers, which will skip the NLTK data download step.docker run -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it --rm taprosoft/kotaemon:v1.1