kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

[BUG] - Docker: Resource punkt not found

Open corani opened this issue 1 year ago • 4 comments

Description

I'm unable to launch the docker container after a clean pull

(WSL2, Docker 24.0.5, image digest sha256:d239cbf3733c58a065c516ab2d936929487cedf39098813d0aada47bbb540f07)

Reproduction steps

docker run -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it --rm taprosoft/kotaemon:v1.0

Screenshots

No response

Logs

Warning: Cannot statically find a gradio demo called demo. Reload work may fail.
Watching: '/app' '/app'

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 65, in __init__
    nltk.data.find("tokenizers/punkt")
  File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 579, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/local/lib/python3.10/site-packages/llama_index/core/_static/nltk_cache'
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/app.py", line 13, in <module>
    from ktem.main import App  # noqa
  File "/app/libs/ktem/ktem/main.py", line 2, in <module>
    from ktem.app import BaseApp
  File "/app/libs/ktem/ktem/app.py", line 8, in <module>
    from ktem.components import reasonings
  File "/app/libs/ktem/ktem/components.py", line 11, in <module>
    from kotaemon.base import BaseComponent
  File "/app/libs/kotaemon/kotaemon/base/__init__.py", line 1, in <module>
    from .component import BaseComponent, Node, Param, lazy
  File "/app/libs/kotaemon/kotaemon/base/component.py", line 6, in <module>
    from kotaemon.base.schema import Document
  File "/app/libs/kotaemon/kotaemon/base/schema.py", line 8, in <module>
    from llama_index.core.bridge.pydantic import Field
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/__init__.py", line 10, in <module>
    from llama_index.core.base.response.schema import Response
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/base/response/schema.py", line 9, in <module>
    from llama_index.core.schema import NodeWithScore
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/schema.py", line 18, in <module>
    from llama_index.core.utils import SAMPLE_TEXT, truncate_text
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 89, in <module>
    globals_helper = GlobalsHelper()
  File "/usr/local/lib/python3.10/site-packages/llama_index/core/utils.py", line 67, in __init__
    nltk.download("punkt_tab", download_dir=self._nltk_data_dir)
  File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 774, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 629, in incr_download
    info = self._info_or_id(info_or_id)
  File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 603, in _info_or_id
    return self.info(info_or_id)
  File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 1006, in info
    self._update_index()
  File "/usr/local/lib/python3.10/site-packages/nltk/downloader.py", line 949, in _update_index
    ElementTree.parse(urlopen(self._url)).getroot()
  File "/usr/local/lib/python3.10/xml/etree/ElementTree.py", line 1222, in parse
    tree.parse(source, parser)
  File "/usr/local/lib/python3.10/xml/etree/ElementTree.py", line 580, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: unclosed token: line 80, column 4

Browsers

No response

OS

No response

Additional information

No response

corani avatar Sep 03 '24 03:09 corani

same happens for me, is there any solution to this?

Neon-20 avatar Sep 03 '24 07:09 Neon-20

This is likely an issue with WSL 2 and the firewall settings of the host preventing the download triggered by nltk.
Not that i recommend disabling the firewall but a quick workaround might be it. See here for a more detailed guide.

ngnhng avatar Sep 04 '24 09:09 ngnhng

I am on a corporate network, so it could be some firewall issue. However, loads of other Docker images can pull from the internet just fine (although Alpine is known to be sketchy with connectivity).

corani avatar Sep 04 '24 09:09 corani

I had the same problem, with only Punkt not downloading and running correctly. But I have tried to solve this problem, I hope my ideas can help you.

  1. Make sure your internet connection is OK, you can try ping google.com

  2. Manually Download the NLTK Package. Once inside the container, run the following command to download and cache the NLTK data.

    python -c "import nltk; nltk.download('punkt')"

  3. After confirming the data is cached, exit the container. Then, save the container as a new image to avoid re-downloading the data in the future.

    docker commit <container_id> taprosoft/kotaemon:v1.1

  4. From now on, you can use the new image taprosoft/kotaemon:v1.1 to start containers, which will skip the NLTK data download step.

    docker run -e GRADIO_SERVER_NAME=0.0.0.0 -e GRADIO_SERVER_PORT=7860 -p 7860:7860 -it --rm taprosoft/kotaemon:v1.1

ArtemisYi avatar Sep 04 '24 10:09 ArtemisYi