camel icon indicating copy to clipboard operation
camel copied to clipboard

feat: data gen pipeline interface modifications

Open hesamsheikh opened this issue 8 months ago • 5 comments

Description

Fixes #1511. Refactors the data generation pipeline for better flexibility. Key changes include:

  • CoTDataGenerator, SelfInstructPipeline, SelfImprovingCoTPipeline, and EvolInstructPipeline inherit from BaseDataGenPipeline for a unified interface.
  • Added parameters for better pipeline control: batch_size: controls processing chunk size max_workers: manages parallel processing resources save_intermediate: enables checkpoint saving during processing results_key: specifies JSON output structure
  • Added support for multiple input formats (file paths, JSONL strings, lists of prompts or texts)
  • standardized result saving functionality when output paths are specified.
  • JSON parsing, error handling, and added logging for better debugging and monitoring.

Documents, tests, and examples are modified to account for this new interface.

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • [x] I have read the CONTRIBUTION guide (required)
  • [x] I have linked this PR to an issue using the Development section on the right sidebar or by adding Fixes #issue-number in the PR description (required)
  • [x] I have checked if any dependencies need to be added or updated in pyproject.toml and uv lock
  • [x] I have updated the tests accordingly (required for a bug fix or a new feature)
  • [x] I have updated the documentation if needed:
  • [x] I have added examples if this is a new feature

If you are unsure about any of these, don't hesitate to ask. We are here to help!

hesamsheikh avatar Apr 13 '25 00:04 hesamsheikh

it seems that we may also need to modify the related cookbooks

zjrwtx avatar Apr 13 '25 03:04 zjrwtx

it seems that we may also need to modify the related cookbooks

most of the cookbooks are for specific camel-ai versions. It's a good idea to update them. If you think it's necessary to modify them for this PR, let me know.

hesamsheikh avatar Apr 13 '25 11:04 hesamsheikh

Fixed the pre-commit errors, formatting issues, and variable type incompatibilities.

hesamsheikh avatar Apr 14 '25 11:04 hesamsheikh

thanks @hesamsheikh left some comments

Thanks @fengju0213 for the helpful comments!

hesamsheikh avatar Apr 18 '25 14:04 hesamsheikh

https://github.com/camel-ai/camel/pull/2270 Implemented several modifications in this PR, integrating multiple features into the base.py

fengju0213 avatar Apr 24 '25 09:04 fengju0213