Telegram 支持从设置导出的聊天记录
setting 里的make_dataset_args.platform 已经改成 telegram。my_id 也改为正确的 from id。 数据集类似: dataset/ └── telegram/ ├── ChatExport_2025-01-01/ │ ├── result.json │ └── chats/ └── chat_01/ └── chat_02/ └── chat_03/
执行:weclone-cli make-dataset 后,可以识别到ChatExport_2025-01-01文件夹。不确定和 telegram desktop 导出的数据格式是否有关,导出时候没有选择Photo,chats 文件夹里面都是表情。只想处理文本内容。尝试删掉 chats 文件夹也没有用
可以发一下result.json的格式吗 你这是mac客户端吗,看起来和Windows的导出结构不太一样
可以发一下result.json的格式吗 你这是mac客户端吗,看起来和Windows的导出结构不太一样
我用 mac 的 telegram desktop 客户端(mac 好像有两个客户端)和 win 的客户端导出的格式都是这样的
{ "about": "Here is the data you requested. Remember: Telegram is ad free, it doesn't use your data for ad targeting and doesn't sell it to others. Telegram only keeps the information it needs to function as a secure and feature-rich cloud service.\n\nCheck out Settings > Privacy & Security on Telegram's mobile apps for the relevant settings.", "chats": { "about": "This page lists all chats from this export.", "list": [ { "name": "Mino", "type": "personal_chat", "id": 1838962236, "messages": [ { "id": 1, "type": "message", "date": "2025-07-11T15:36:11", "date_unixtime": "1752219371", "from": "Mino", "from_id": "user1838962236", "text": "hello", "text_entities": [ { "type": "plain", "text": "hello" } ] }, { "id": 2, "type": "message", "date": "2025-07-11T15:36:23", "date_unixtime": "1752219383", "from": "Mino", "from_id": "user1838962236", "text": "测试 1", "text_entities": [ { "type": "plain", "text": "测试 1" } ] }, { "id": 3, "type": "message", "date": "2025-07-11T15:36:41", "date_unixtime": "1752219401", "from": "Mino", "from_id": "user7791826849", "text": "测试 2", "text_entities": [ { "type": "plain", "text": "测试 2" } ] }, { "id": 4, "type": "message", "date": "2025-07-11T15:36:50", "date_unixtime": "1752219410", "from": "Mino", "from_id": "user1838962236", "text": "测试 3", "text_entities": [ { "type": "plain", "text": "测试 3" } ] }, { "id": 5, "type": "message", "date": "2025-07-11T15:36:56", "date_unixtime": "1752219416", "from": "Mino", "from_id": "user1838962236", "photo": "chats/chat_1/photos/photo_1@11-07-2025_15-36-56.jpg", "photo_file_size": 82416, "width": 1280, "height": 669, "text": "", "text_entities": [] }, { "id": 6, "type": "message", "date": "2025-07-11T15:37:04", "date_unixtime": "1752219424", "from": "Mino", "from_id": "user1838962236", "file": "(File not included. Change data exporting settings to download.)", "file_name": "AnimatedSticker.tgs", "file_size": 25546, "thumbnail": "(File not included. Change data exporting settings to download.)", "thumbnail_file_size": 4952, "media_type": "sticker", "sticker_emoji": "😂", "mime_type": "application/x-tgsticker", "width": 512, "height": 512, "text": "", "text_entities": [] }, { "id": 7, "type": "message", "date": "2025-07-11T15:37:12", "date_unixtime": "1752219432", "from": "Mino", "from_id": "user7791826849", "text": "测试 4", "text_entities": [ { "type": "plain", "text": "测试 4" } ] }, { "id": 8, "type": "message", "date": "2025-07-11T15:37:26", "date_unixtime": "1752219446", "from": "Mino", "from_id": "user7791826849", "photo": "chats/chat_1/photos/photo_2@11-07-2025_15-37-26.jpg", "photo_file_size": 140425, "width": 1280, "height": 853, "text": "", "text_entities": [] }, { "id": 9, "type": "message", "date": "2025-07-11T15:37:32", "date_unixtime": "1752219452", "from": "Mino", "from_id": "user7791826849", "file": "(File not included. Change data exporting settings to download.)", "file_name": "AnimatedSticker.tgs", "file_size": 8244, "thumbnail": "(File not included. Change data exporting settings to download.)", "thumbnail_file_size": 2750, "media_type": "sticker", "sticker_emoji": "😂", "mime_type": "application/x-tgsticker", "width": 512, "height": 512, "text": "", "text_entities": [] } ] } ] } }
测试内容大概是这样。目录格式是:
我是这个版本导出格式和你的完全不一样
我是这个版本导出格式和你的完全不一样
明白了 我是单个人点右上角导出的,你是全部导出的
明白了 我是单个人点右上角导出的,你是全部导出的
明白了。想再多问一下。可以同时训练 telegram 的数据和微信的数据吗。预处理步骤会清空之前的预处理好的数据吗?因为我看 Setting 里面只能填写一个 platform
预处理会清空之前的 ,你可以分开预处理,然后手动放一个sft-my.json文件里
我是这个版本导出格式和你的完全不一样