data-juicer
data-juicer copied to clipboard
是否支持ShareGPT多轮对话数据清洗
Before Asking 在提问之前
-
[x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
-
[x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
- [x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。
Question
请问data-juicer是否支持sharegpt多轮对话格式的语料清洗,如脱敏、标注、去重、质量评估等,是否有示例,谢谢。
[
{
"conversations": [
{
"from": "system",
"value": "..."
},
{
"from": "human",
"value": "user instruction"
},
{
"from": "gpt",
"value": "model response"
},
{
"from": "human",
"value": "user instruction"
},
{
"from": "gpt",
"value": "model response"
},
...
]
},
{
"conversations": [
{
"from": "system",
"value": "..."
},
{
"from": "human",
"value": "user instruction"
},
{
"from": "gpt",
"value": "model response"
},
...
]
},
...
]
Additional 额外信息
No response