Make soft404 package cohesive with modern Python packaging structure
Overview
This PR addresses issue #[number] by transforming the soft404 package from a collection of scripts and notebooks into a professional, modern Python package with proper structure, entry points, and documentation.
Changes Made
🎯 New Command-Line Interface
Added soft404/__main__.py to enable direct module execution:
# Predict from an HTML file
python -m soft404 page.html
# Predict from inline HTML
python -m soft404 --html '<h1>Page not found</h1>'
# Show help
python -m soft404 --help
📦 Console Entry Points
Added globally accessible console scripts in setup.py:
# Training tool (previously ./soft404/train.py)
soft404-train items --save clf.joblib
# Conversion tool (previously ./soft404/convert_to_text.py)
soft404-convert pages.jl.gz items
Users no longer need to know the exact paths to scripts or use ./ prefixes.
🔧 Dependency Management
-
Fixed deprecated imports: Updated
sklearn.externals.joblibto directjoblibimports inpredict.pyandtrain.py -
Added explicit dependencies: Added
joblibtoinstall_requiressince it's no longer bundled with scikit-learn -
Organized dev dependencies: Added
extras_requirefor clean separation:pip install soft404 # Production use (minimal dependencies) pip install soft404[dev] # Development/training (includes all tools)
🐍 Python Version Support
Updated classifiers to officially declare support for Python 3.6 through 3.12, making the package's compatibility clear to users and tools.
📚 Documentation
Updated README.rst with:
- Command-line usage examples
- New installation instructions
- Development setup guide
- Clear distinction between production and development use
Added IMPROVEMENTS.md with detailed summary of all changes and migration guide.
🔄 Backward Compatibility
Added compatibility layer in predict.py to support older pre-trained models that were saved with sklearn < 0.23. All existing code continues to work without changes.
Benefits
Before:
# Required knowing exact paths
./soft404/train.py items
./soft404/convert_to_text.py pages.jl.gz items
# Required writing Python code for predictions
python -c "from soft404 import probability; print(probability(html))"
# Dev dependencies in separate file
pip install -r requirements_dev.txt
After:
# Professional command-line tools
soft404-train items
soft404-convert pages.jl.gz items
# User-friendly CLI
python -m soft404 page.html
# Modern dependency management
pip install soft404[dev]
Testing
- ✅ All utility tests pass (3/3)
- ✅ Package imports successfully
- ✅ Entry points registered and functional
- ✅ Command-line interface works correctly
- ✅ Backward compatibility maintained
Migration
Existing users: No changes required! All existing Python API code continues to work.
New users: Can take advantage of modern features like console commands and CLI interface.
Statistics
- Files changed: 6 (4 modified, 2 new)
- Lines added: 226
- No breaking changes
The soft404 package is now a cohesive, professional Python package following modern best practices and ready for easy use and distribution.
Original prompt
This section details on the original issue you should resolve
<issue_title>make cohesive</issue_title> <issue_description>Lot of notebooks. make this a real thing, setup.py pyinit.whatever etc.</issue_description>
Comments on the Issue (you are @copilot in this section)
Fixes TeamHG-Memex/soft404#34
💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.