Add CLI commands for browsing and searching OpenML datasets

Open pankajbaid567 opened this issue 1 month ago • 0 comments

Add three new CLI subcommands under 'openml datasets':

openml datasets list: List datasets with optional filtering
openml datasets info: Display detailed dataset information
openml datasets search: Search datasets by name (case-insensitive)

Features:

Support for multiple filter options (tag, status, size, instances, features, classes)
Output formatting (table/json) with verbose mode
Pagination support (offset, size)
Comprehensive test suite with mocked API calls
Proper error handling

Addresses ESoC 2025 goal of improving user experience of the dataset catalogue.

Metadata

Reference Issue: Related to issue #1503 Add CLI Commands for browsing and searching OpenML datasets
New Tests Added: Yes
Documentation Updated: No (CLI help text serves as documentation)
Change Log Entry: "Add CLI commands for browsing and searching OpenML datasets: openml datasets list, openml datasets info, and openml datasets search"

Details

What does this PR implement/fix?

This PR adds three new CLI subcommands under openml datasets to improve the user experience of the dataset catalogue:

openml datasets list - List datasets with optional filtering (tag, status, data_name, number_instances, number_features, number_classes, pagination, output format)
openml datasets info <dataset_id> - Display detailed information about a specific dataset including qualities, features, and metadata
openml datasets search <query> - Search datasets by name with case-insensitive matching

Why is this change necessary? What is the problem it solves?

Currently, users must write Python code to browse or search OpenML datasets, even for simple tasks like listing available datasets or finding a specific dataset. This creates a barrier to entry and makes the dataset catalogue less accessible. Adding CLI commands allows users to interact with the dataset catalogue directly from the command line without writing code.

This directly addresses the ESoC 2025 goal of "Improving user experience of the dataset catalogue in AIoD and OpenML".

How can I reproduce the issue this PR is solving and its solution?

Before (requires Python code):

import openml
datasets = openml.datasets.list_datasets(size=10)
for did, dataset in datasets.items():
    print(f"{did}: {dataset['name']}")

After (CLI commands):

# List first 10 datasets
openml datasets list --size 10

# Search for iris datasets
openml datasets search iris

# Get detailed info about a dataset
openml datasets info 61

# List datasets with a specific tag, formatted as table
openml datasets list --tag study_14 --format table --verbose

# Filter by number of instances
openml datasets list --number-instances "100..1000"

Implementation Details:

Added three new functions in openml/cli.py: datasets_list(), datasets_info(), datasets_search()
Added helper function _format_output() for consistent output formatting (table/JSON)
Integrated into main CLI parser with proper argument handling
Added comprehensive test suite in tests/test_openml/test_cli.py (11 test cases)
Uses existing openml.datasets.list_datasets() and openml.datasets.get_dataset() functions - no changes to core API
Follows existing CLI patterns (similar to configure command)
All tests use mocked API calls to avoid requiring server connections

Nov 24 '25 14:11 pankajbaid567