Question: How DLRover integrate with Llama Factory?

Open heting-bes opened this issue 1 year ago • 1 comments

Aug 21 '24 06:08 heting-bes

直觉是修改examples/pytorch/nanogpt/elastic_job.yaml:

command: - /bin/bash - -c - "dlrover-run --network-check --nnodes=$NODE_NUM
--nproc_per_node=1 --max_restarts=1
./examples/pytorch/nanogpt/train.py
--data_dir /data/nanogpt/"

改为如下形式，报错：找不到 llamafactory-cli这个文件，也即是必须后面需要跟train.py文件？

command: - /bin/bash - -c - "dlrover-run --network-check --nnodes=$NODE_NUM
--nproc_per_node=1 --max_restarts=1
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml"

Aug 21 '24 06:08 heting-bes

U should encapsulate the usage of your CLI within your training script.

Nov 27 '24 09:11 BalaBalaYi

This issue has been automatically marked as stale because it has not had recent activity.

Feb 26 '25 01:02 github-actions[bot]

This issue is being automatically closed due to inactivity.

Mar 05 '25 01:03 github-actions[bot]