feat(envs): add concatenate-safe dataset creation and concatenation utility

Open srthkdev opened this issue 5 months ago • 0 comments

Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation
Serialize info dictionaries as JSON strings to ensure compatible column types
Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True
Convert incompatible column types to strings using PyArrow checks before dataset creation
Implement static method concatenate_datasets to merge multiple datasets with schema alignment
Handle missing columns, type inconsistencies, and optional split column in concatenated dataset
Rename parse_completion_tokens method to process_completion_tokens for clarity

Description

This PR fixes the dataset concatenation issue in the verifiers library where multiple datasets created from env.make_dataset had different columns with different pyarrow types, preventing concatenation of multiple datasets. solves #321

The fix implements proper schema standardization in the make_dataset method and enhances the concatenate_datasets static method to handle type incompatibilities intelligently. This allows users to run multiple evaluation results against different eval environments and push them as a single dataset instance with splits for each benchmark.

Key improvements:

Enhanced make_dataset method to ensure consistent schemas with standard columns
Improved concatenate_datasets method with intelligent type inference and standardization
Added concatenate_safe parameter (default: True) to ensure compatibility by default
Proper handling of the 'info' column to ensure consistent formatting
Added split tracking to identify the source of each example in concatenated datasets

Type of Change

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation update
[ ] Test improvement

Testing

[x] All existing tests pass
[x] New tests have been added to cover the changes
[x] Tests have been run locally with uv run pytest

Checklist

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] Any dependent changes have been merged and published

Additional Notes

This fix addresses the issue mentioned in the original problem where prime-rl had to implement a custom make_dataset function to handle concatenation of datasets from different environments. The solution has been properly upstreamed to the verifiers library with comprehensive testing and real-world validation.

The fix ensures that datasets from different environments can now be properly concatenated regardless of their original schema differences, with proper type handling and split identification.

Sep 16 '25 16:09 srthkdev