Solving issue #11362. Creating nltk_data in the $HOME directory.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
The motivation is to solve issue #11362. NLTK default download creates a weird folder in the project directory level named ~/nltk_data, when it seems the intention is to create this at the $HOME directory.
Modification
I solve this issue using os library to efectively substitute ~ by $HOME directory. I modify it in the code and also in the related documentation.
Checklist
- Pre-commit or other linting tools are used to fix the potential lint issues.
- The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
- If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMPreTrain.
- The documentation has been modified accordingly, like docstring or example tutorials.
@javierganan99 Please fix the lint
I think that the error in the unittest is not related with my PR
Suggesting Enhancement to This PR
Thanks @javierganan99 for this fix. Your os.path.expanduser() approach correctly solves the tilde expansion problem.
I'd like to suggest a small enhancement to make this PR more flexible for different deployment environments.
Why Tilde Expansion Alone May Not Be Sufficient
While tilde expansion fixes the immediate issue, hardcoding any specific path can still cause problems in containerized or restricted environments where the home directory may not be writable due to permission constraints.
Different environments often require different paths:
- Local development:
~/nltk_data - Containerized environments:
/tmp/nltk_dataor/opt/nltk_data - Shared storage:
/mnt/shared/nltk_data
Solution: Use NLTK_DATA Environment Variable
NLTK itself uses the NLTK_DATA environment variable as the standard way to specify data location. By respecting this convention, we can handle various permission constraints across different environments while maintaining backward compatibility.
Proposed change (only 2 lines added):
import os
def find_noun_phrases(caption: str) -> tuple:
try:
import nltk
# Use NLTK_DATA if set, otherwise expand ~/nltk_data
nltk_data_dir = os.environ.get('NLTK_DATA')
if nltk_data_dir is None:
nltk_data_dir = os.path.expanduser('~/nltk_data')
nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('averaged_perceptron_tagger', download_dir=nltk_data_dir)
except ImportError:
raise RuntimeError('nltk is not installed...')
Why this is better:
- Respects NLTK's standard convention
- Handles permission constraints in various environments
- Maintains your fix: Still uses
~/nltk_dataas fallback whenNLTK_DATAis not set - Zero breaking changes: Default behavior is identical to your PR
Offer to Contribute
I'm happy to:
- Add this 2-line change to your PR (if you grant me access to your branch)
- Add unit tests if needed
Or you can apply this yourself - it's a minimal change.
Anyway, I hope this PR gets active and merged (since I also struggled with the tilde expansion bug and had to work around it).
What do you think? @javierganan99 @hhaAndroid