Unstructured text extractor missing nltk/data punkt
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.6.8
Cloud or Self Hosted
Self Hosted (Docker), Self Hosted (Source)
Steps to reproduce
Upload textfile using unstructured ETL type to knowledge.
✔️ Expected Behavior
import nltk punk before usage.
❌ Actual Behavior
nltk is installed with unstructured, but maybe needed to be imported in the unstructured extractors?
Error: with indexing estimate api, core/rag/extractor/unstructured/unstructured_text_extractor.py in extract at line 29
LookupError:
**********************************************************************
Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:
[31m>>> import nltk
>>> nltk.download('punkt')
[0m
For more information see: https://www.nltk.org/data.html
Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m
Searched in:
- '/root/nltk_data'
- '/usr/local/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
Suggestion:
import nltk
# Download the 'punkt' resource
nltk.download('punkt')
To resolve the issue with the NLTK 'punkt' resource not being found in Dify version 0.6.8, you need to ensure that the 'punkt' resource is downloaded. You can do this by adding the following code to download the 'punkt' resource if it is not already available.
Here is the updated UnstructuredTextExtractor class with the necessary code to download the 'punkt' resource:
import logging
import nltk
from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document
logger = logging.getLogger(__name__)
class UnstructuredTextExtractor(BaseExtractor):
"""Load msg files.
Args:
file_path: Path to the file to load.
"""
def __init__(
self,
file_path: str,
api_url: str
):
"""Initialize with file path."""
self._file_path = file_path
self._api_url = api_url
# Ensure the 'punkt' resource is downloaded
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
def extract(self) -> list[Document]:
from unstructured.partition.text import partition_text
elements = partition_text(filename=self._file_path)
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements, max_characters=2000, combine_text_under_n_chars=2000)
documents = []
for chunk in chunks:
text = chunk.text.strip()
documents.append(Document(page_content=text))
return documents
This code checks if the 'punkt' tokenizer is available and downloads it if it is not, ensuring that the NLTK 'punkt' resource is available when needed [1].
Details
About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
To continue the conversation, mention @dosu.