mmdetection icon indicating copy to clipboard operation
mmdetection copied to clipboard

Solving issue #11362. Creating nltk_data in the $HOME directory.

Open javierganan99 opened this issue 1 year ago • 4 comments

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

The motivation is to solve issue #11362. NLTK default download creates a weird folder in the project directory level named ~/nltk_data, when it seems the intention is to create this at the $HOME directory.

Modification

I solve this issue using os library to efectively substitute ~ by $HOME directory. I modify it in the code and also in the related documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMPreTrain.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

javierganan99 avatar Feb 21 '24 13:02 javierganan99

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 21 '24 13:02 CLAassistant

@javierganan99 Please fix the lint

hhaAndroid avatar Feb 23 '24 09:02 hhaAndroid

I think that the error in the unittest is not related with my PR

javierganan99 avatar Feb 23 '24 17:02 javierganan99

Suggesting Enhancement to This PR

Thanks @javierganan99 for this fix. Your os.path.expanduser() approach correctly solves the tilde expansion problem.

I'd like to suggest a small enhancement to make this PR more flexible for different deployment environments.


Why Tilde Expansion Alone May Not Be Sufficient

While tilde expansion fixes the immediate issue, hardcoding any specific path can still cause problems in containerized or restricted environments where the home directory may not be writable due to permission constraints.

Different environments often require different paths:

  • Local development: ~/nltk_data
  • Containerized environments: /tmp/nltk_data or /opt/nltk_data
  • Shared storage: /mnt/shared/nltk_data

Solution: Use NLTK_DATA Environment Variable

NLTK itself uses the NLTK_DATA environment variable as the standard way to specify data location. By respecting this convention, we can handle various permission constraints across different environments while maintaining backward compatibility.

Proposed change (only 2 lines added):

import os

def find_noun_phrases(caption: str) -> tuple:
    try:
        import nltk

        # Use NLTK_DATA if set, otherwise expand ~/nltk_data
        nltk_data_dir = os.environ.get('NLTK_DATA')
        if nltk_data_dir is None:
            nltk_data_dir = os.path.expanduser('~/nltk_data')

        nltk.download('punkt', download_dir=nltk_data_dir)
        nltk.download('averaged_perceptron_tagger', download_dir=nltk_data_dir)
    except ImportError:
        raise RuntimeError('nltk is not installed...')

Why this is better:

  1. Respects NLTK's standard convention
  2. Handles permission constraints in various environments
  3. Maintains your fix: Still uses ~/nltk_data as fallback when NLTK_DATA is not set
  4. Zero breaking changes: Default behavior is identical to your PR

Offer to Contribute

I'm happy to:

  • Add this 2-line change to your PR (if you grant me access to your branch)
  • Add unit tests if needed

Or you can apply this yourself - it's a minimal change.

Anyway, I hope this PR gets active and merged (since I also struggled with the tilde expansion bug and had to work around it).

What do you think? @javierganan99 @hhaAndroid

gridhra avatar Nov 23 '25 07:11 gridhra