Unstructured text extractor missing nltk/data punkt

Open AndyMik90 opened this issue 1 year ago • 1 comments

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.8

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Upload textfile using unstructured ETL type to knowledge.

✔️ Expected Behavior

import nltk punk before usage.

❌ Actual Behavior

nltk is installed with unstructured, but maybe needed to be imported in the unstructured extractors?

Error: with indexing estimate api, core/rag/extractor/unstructured/unstructured_text_extractor.py in extract at line 29

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Suggestion:

import nltk

# Download the 'punkt' resource
nltk.download('punkt')

May 26 '24 07:05 AndyMik90

To resolve the issue with the NLTK 'punkt' resource not being found in Dify version 0.6.8, you need to ensure that the 'punkt' resource is downloaded. You can do this by adding the following code to download the 'punkt' resource if it is not already available.

Here is the updated UnstructuredTextExtractor class with the necessary code to download the 'punkt' resource:

import logging
import nltk

from core.rag.extractor.extractor_base import BaseExtractor
from core.rag.models.document import Document

logger = logging.getLogger(__name__)

class UnstructuredTextExtractor(BaseExtractor):
    """Load msg files.

    Args:
        file_path: Path to the file to load.
    """

    def __init__(
        self,
        file_path: str,
        api_url: str
    ):
        """Initialize with file path."""
        self._file_path = file_path
        self._api_url = api_url

        # Ensure the 'punkt' resource is downloaded
        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            nltk.download('punkt')

    def extract(self) -> list[Document]:
        from unstructured.partition.text import partition_text

        elements = partition_text(filename=self._file_path)
        from unstructured.chunking.title import chunk_by_title
        chunks = chunk_by_title(elements, max_characters=2000, combine_text_under_n_chars=2000)
        documents = []
        for chunk in chunks:
            text = chunk.text.strip()
            documents.append(Document(page_content=text))

        return documents

This code checks if the 'punkt' tokenizer is available and downloads it if it is not, ensuring that the NLTK 'punkt' resource is available when needed ^[1].

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 26 '24 07:05 dosubot[bot]