Yilun Huang comments

Results 39 comments of


                                            Yilun Huang

搭建好环境后执行python tools/process_data.py --config configs/demo/process.yaml 命令报错

Close this stale issue.

process_data.py pre-start is too slow 数据处理脚本启动过慢

Yes. For a quick run or a small dataset, it's indeed a problem. We think loading some heavy dependencies for the first time might cause this. But we will profile...

image_caption_mapper等类似算子使用前怎么处理自己的数据格式

嗨 @Crazy-JY ，感谢你对Data-Juicer的关注与使用！简单说需要将数据集中的单条样本组织为[这里](https://github.com/modelscope/data-juicer/blob/main/tools/fmt_conversion/README_ZH.md)的格式。对于你的情况的话，如果你仅需要使用image_caption_mapper对已有的几张图片进行处理，那除了几张图片外，你还需要一个数据集文件，以jsonl格式为例，你可能需要为这几张图片创建一个`dataset.jsonl`文件，其中对于每张图片，每个样本可简单准备为： ```json { "text": "", "images": ["/path/to/img1"] } ``` 由于初始图片没有对应的caption，因此text字段处仅有一个image的特殊token作为占位符，表示这个样本中包含一张图片；images字段中则把该样本对应的图片路径放到列表里即可。这个数据集可简单由这段代码片段生成： ```python import os import jsonlines from data_juicer.utils.mm_utils import SpecialTokens image_dir = 'data' # 放置图片的目录路径 dataset_file...

image_caption_mapper等类似算子使用前怎么处理自己的数据格式

`image_captioning_mapper`算子里默认支持的是类似于BLIP-2这样的模型，你使用的InternVL2_5-2B这类VLM模型有自己的一套tokenization和generate或者chat的接口，所以它和这个算子的实现没有很匹配，建议你可以根据这个算子的实现和InternVL2的使用示例实现一个新算子。后续没有输出是因为复用了第一次处理失败时的cache，在测试时可以在配置文件中设置`use_cache: false`来关闭cache，在大规模数据处理时再打开cache。

image_caption_mapper等类似算子使用前怎么处理自己的数据格式

Close this stale issue.

在使用提供的的jupyterLab提供的palyground时候，网站报错

嗨，playground我们已经恢复，请你再次尝试~不过里面的内容就目前而言有些out-of-date，我们打算近期对其中的内容进行更新，敬请关注！

Add mllm_mapper

Close due to this PR was included in PR #550

[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles

Please merge the latest main branch and run pre-commit locally.

Evalscope evaluator & MedEval evaluator for dj-sandbox

Please merge the latest main branch.

Evalscope evaluator & MedEval evaluator for dj-sandbox

> > Please merge the latest main branch. > > There may be an issue with pre-commit regarding the following three files. In my local pre-commit process, the `import wandb`...