stream2sentence icon indicating copy to clipboard operation
stream2sentence copied to clipboard

Does nltk.download("punkt_tab") needed everytime initializing?

Open nosoyyo opened this issue 7 months ago • 1 comments

Hi, I've found your package and tools extremely useful and have been doing extensive research with them. However, I'm encountering a minor but frustrating issue: from RealtimeTTS import TextToAudioStream takes minutes to load.

After investigating, I traced it to this function in stream2sentence.py:

def initialize_nltk(debug=False):
    """Initializes NLTK by downloading required data for sentence tokenization."""
    global nltk_initialized
    if nltk_initialized:
        return

    logging.info("Initializing NLTK Tokenizer")

    try:
        import nltk
        nltk.download("punkt_tab", quiet=not debug)
        nltk_initialized = True
    except Exception as e:
        print(f"Error initializing nltk tokenizer: {e}")
        nltk_initialized = False

The specific culprit is nltk.download("punkt_tab", quiet=not debug). Temporarily, I have to comment it out to reduce loading time to a normal and acceptable situation like several seconds.

In my network environment (China, with proxy issues), this gets stuck. Interestingly, everything works perfectly even when this download fails after minutes of waiting.

Since users usually have all necessary NLTK components after first several times of using and the TTS functions work flawlessly without this download, could you consider either:

  • Removing this mandatory download, or
  • Implementing a check to only download when truly necessary?

Anyways, thanks a lot for making this great tool😄

nosoyyo avatar Jun 12 '25 08:06 nosoyyo

Yes, that makes sense. It's supposed to cache once loaded successfully one time.

I had a similar problem once with another user. We solved it by using another internet connection a single time just to download nltk. After that his problems were gone.

KoljaB avatar Jun 12 '25 08:06 KoljaB