Add CLI commands for browsing and searching OpenML datasets
Add three new CLI subcommands under 'openml datasets':
- openml datasets list: List datasets with optional filtering
- openml datasets info: Display detailed dataset information
- openml datasets search: Search datasets by name (case-insensitive)
Features:
- Support for multiple filter options (tag, status, size, instances, features, classes)
- Output formatting (table/json) with verbose mode
- Pagination support (offset, size)
- Comprehensive test suite with mocked API calls
- Proper error handling
Addresses ESoC 2025 goal of improving user experience of the dataset catalogue.
Related to issue #1503 Add CLI Commands for browsing and searching OpenML datasets
Metadata
- Reference Issue: Related to issue #1503 Add CLI Commands for browsing and searching OpenML datasets
- New Tests Added: Yes
- Documentation Updated: No (CLI help text serves as documentation)
- Change Log Entry: "Add CLI commands for browsing and searching OpenML datasets:
openml datasets list,openml datasets info, andopenml datasets search"
Details
What does this PR implement/fix?
This PR adds three new CLI subcommands under openml datasets to improve the user experience of the dataset catalogue:
openml datasets list- List datasets with optional filtering (tag, status, data_name, number_instances, number_features, number_classes, pagination, output format)openml datasets info <dataset_id>- Display detailed information about a specific dataset including qualities, features, and metadataopenml datasets search <query>- Search datasets by name with case-insensitive matching
Why is this change necessary? What is the problem it solves?
Currently, users must write Python code to browse or search OpenML datasets, even for simple tasks like listing available datasets or finding a specific dataset. This creates a barrier to entry and makes the dataset catalogue less accessible. Adding CLI commands allows users to interact with the dataset catalogue directly from the command line without writing code.
This directly addresses the ESoC 2025 goal of "Improving user experience of the dataset catalogue in AIoD and OpenML".
How can I reproduce the issue this PR is solving and its solution?
Before (requires Python code):
import openml
datasets = openml.datasets.list_datasets(size=10)
for did, dataset in datasets.items():
print(f"{did}: {dataset['name']}")
After (CLI commands):
# List first 10 datasets
openml datasets list --size 10
# Search for iris datasets
openml datasets search iris
# Get detailed info about a dataset
openml datasets info 61
# List datasets with a specific tag, formatted as table
openml datasets list --tag study_14 --format table --verbose
# Filter by number of instances
openml datasets list --number-instances "100..1000"
Implementation Details:
- Added three new functions in
openml/cli.py:datasets_list(),datasets_info(),datasets_search() - Added helper function
_format_output()for consistent output formatting (table/JSON) - Integrated into main CLI parser with proper argument handling
- Added comprehensive test suite in
tests/test_openml/test_cli.py(11 test cases) - Uses existing
openml.datasets.list_datasets()andopenml.datasets.get_dataset()functions - no changes to core API - Follows existing CLI patterns (similar to
configurecommand) - All tests use mocked API calls to avoid requiring server connections