data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

是否支持ShareGPT多轮对话数据清洗

Open MLikeWater opened this issue 7 months ago • 0 comments

Before Asking 在提问之前

  • [x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • [x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

请问data-juicer是否支持sharegpt多轮对话格式的语料清洗,如脱敏、标注、去重、质量评估等,是否有示例,谢谢。

[
  {
    "conversations": [
      {
        "from": "system",
        "value": "..."
      },
      {
        "from": "human",
        "value": "user instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      },
      {
        "from": "human",
        "value": "user instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      },
      ...
    ]
  },
  {
    "conversations": [
      {
        "from": "system",
        "value": "..."
      },
      {
        "from": "human",
        "value": "user instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      },
      ...
    ]
  },
  ...
]

Additional 额外信息

No response

MLikeWater avatar Mar 29 '25 12:03 MLikeWater