data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

how to skip to last step to generate jsonl from arrow format

Open gongysh2004 opened this issue 7 months ago • 0 comments

Before Asking 在提问之前

  • [x] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • [x] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • [x] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

I have run through all steps:mapper, filter, and come to last step to Creating json from Arrow format, but it takes up all my memory of the size 2T, even oom killed my ssh service. to debug, I want to execute it directly to the last step.

python -u $DJ_PATH/tools/process_data.py --config "$CONFIG" 2>&1 | tee "dj_log_${CONFIG%.*}$(date +"%Y%m%d%H%M%S").log"

`document_simhash_deduplicator_compute_hash (num_proc=32): 100%|##########| 24814668/24814668 [33:27<00:00, 12363.53 examples/s] 2025-04-18 18:45:40 | INFO | data_juicer.ops.deduplicator.document_simhash_deduplicator:137 - Start querying 24814668 samples. 2025-04-18 18:49:10 | INFO | data_juicer.ops.deduplicator.document_simhash_deduplicator:143 - Querying done, found 29337 matches. 2025-04-18 18:53:28 | INFO | data_juicer.ops.deduplicator.document_simhash_deduplicator:190 - Found 1586 clusters and 5290 hashes. Filter: 100%|##########| 24814668/24814668 [12:36<00:00, 32793.28 examples/s] 2025-04-18 19:06:06 | INFO | data_juicer.ops.deduplicator.document_simhash_deduplicator:224 - Keep 24808074 samples after SimHash dedup. 2025-04-18 19:06:14 | INFO | data_juicer.core.data:226 - [6/6] OP [document_simhash_deduplicator] Done in 3612.918s. Left 24808074 samples. 2025-04-18 19:06:19 | INFO | data_juicer.utils.logger_utils:230 - Processing finished with: Warnings: 4 Errors: 0

Error/Warning details can be found in the log file [/root/autodl-tmp/minideepseek/v3/data/deep_clean/djed_slimpajama/log/export_djed_slimpajama.jsonl_time_20250418164741.txt] and its related log files. 2025-04-18 19:06:32 | INFO | data_juicer.core.executor:211 - All OPs are done in 7321.544s. 2025-04-18 19:06:32 | INFO | data_juicer.core.executor:214 - Exporting dataset to disk... 2025-04-18 19:06:32 | INFO | data_juicer.core.exporter:111 - Exporting computed stats into a single file... Creating json from Arrow format: 0%| | 0/24809 00:00<?, ?ba/s`

Additional 额外信息

No response

gongysh2004 avatar Apr 19 '25 03:04 gongysh2004