feat(envs): add concatenate-safe dataset creation and concatenation utility
- Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation
- Serialize info dictionaries as JSON strings to ensure compatible column types
- Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True
- Convert incompatible column types to strings using PyArrow checks before dataset creation
- Implement static method concatenate_datasets to merge multiple datasets with schema alignment
- Handle missing columns, type inconsistencies, and optional split column in concatenated dataset
- Rename parse_completion_tokens method to process_completion_tokens for clarity
Description
This PR fixes the dataset concatenation issue in the verifiers library where multiple datasets created from env.make_dataset had different columns with different pyarrow types, preventing concatenation of multiple datasets.
solves #321
The fix implements proper schema standardization in the make_dataset method and enhances the concatenate_datasets static method to handle type incompatibilities intelligently. This allows users to run multiple evaluation results against different eval environments and push them as a single dataset instance with splits for each benchmark.
Key improvements:
- Enhanced
make_datasetmethod to ensure consistent schemas with standard columns - Improved
concatenate_datasetsmethod with intelligent type inference and standardization - Added
concatenate_safeparameter (default: True) to ensure compatibility by default - Proper handling of the 'info' column to ensure consistent formatting
- Added split tracking to identify the source of each example in concatenated datasets
Type of Change
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Test improvement
Testing
- [x] All existing tests pass
- [x] New tests have been added to cover the changes
- [x] Tests have been run locally with
uv run pytest
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] Any dependent changes have been merged and published
Additional Notes
This fix addresses the issue mentioned in the original problem where prime-rl had to implement a custom make_dataset function to handle concatenation of datasets from different environments. The solution has been properly upstreamed to the verifiers library with comprehensive testing and real-world validation.
The fix ensures that datasets from different environments can now be properly concatenated regardless of their original schema differences, with proper type handling and split identification.