feat: data gen pipeline interface modifications
Description
Fixes #1511. Refactors the data generation pipeline for better flexibility. Key changes include:
- CoTDataGenerator, SelfInstructPipeline, SelfImprovingCoTPipeline, and EvolInstructPipeline inherit from BaseDataGenPipeline for a unified interface.
- Added parameters for better pipeline control:
batch_size: controls processing chunk sizemax_workers: manages parallel processing resourcessave_intermediate: enables checkpoint saving during processingresults_key: specifies JSON output structure - Added support for multiple input formats (file paths, JSONL strings, lists of prompts or texts)
- standardized result saving functionality when output paths are specified.
- JSON parsing, error handling, and added logging for better debugging and monitoring.
Documents, tests, and examples are modified to account for this new interface.
Checklist
Go over all the following points, and put an x in all the boxes that apply.
- [x] I have read the CONTRIBUTION guide (required)
- [x] I have linked this PR to an issue using the Development section on the right sidebar or by adding
Fixes #issue-numberin the PR description (required) - [x] I have checked if any dependencies need to be added or updated in
pyproject.tomlanduv lock - [x] I have updated the tests accordingly (required for a bug fix or a new feature)
- [x] I have updated the documentation if needed:
- [x] I have added examples if this is a new feature
If you are unsure about any of these, don't hesitate to ask. We are here to help!
it seems that we may also need to modify the related cookbooks
it seems that we may also need to modify the related cookbooks
most of the cookbooks are for specific camel-ai versions. It's a good idea to update them. If you think it's necessary to modify them for this PR, let me know.
Fixed the pre-commit errors, formatting issues, and variable type incompatibilities.
thanks @hesamsheikh left some comments
Thanks @fengju0213 for the helpful comments!
https://github.com/camel-ai/camel/pull/2270 Implemented several modifications in this PR, integrating multiple features into the base.py