xtuner issues

4

Traceback (most recent call last): File "/home/wumao/xtuner-main/xtuner/tools/model_converters/pth_to_hf.py", line 158, in main() File "/home/wumao/xtuner-main/xtuner/tools/model_converters/pth_to_hf.py", line 78, in main model = BUILDER.build(cfg.model) File "/home/wumao/miniconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs,...

1518630367

log输出中的time的意思

2

log输出： ``` eta: 0:00:03 time: 0.0768 data_time: 0.0137 memory: 9806 ``` 其中time & data_time 分别是什么意思？是否包含梯度回传的时间？ memory的单位是什么？谢谢

shockjiang

无法启动训练，似乎是mmengine有问题

4

我在训练时输出以下内容后，程序就停止了，请问这种情况该如何解决？ `2024-05-15 09:29:44.939294: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-15 09:29:44.939347: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register...

Dominic23331

报错FileNotFoundError: [Errno 2] No such file or directory: '/app/work_dirs/chatglm2_6b_qlora_lawyer_e3_copy/20240514_035914/vis_data/eval_outputs_iter_499.txt'

1

修改tensorboard的日志目录后，当微调step执行到save_step时，报错FileNotFoundError: [Errno 2] No such file or directory: '/app/work_dirs/chatglm2_6b_qlora_lawyer_e3_copy/20240514_035914/vis_data/eval_outputs_iter_499.txt'. debug发现是在到达save_step后，xtuner调用evaluate_chat_hook.py的_save_eval_output方法： ```python def _save_eval_output(self, runner, eval_outputs): save_path = os.path.join(runner.log_dir, 'vis_data', f'eval_outputs_iter_{runner.iter}.txt') with open(save_path, 'w', encoding='utf-8') as f: for i, output in...

rcejzibjks38

如何再8*A100上预训练128k长度的llama3？

2

看README的图表是可以训练的，但是我一直OOM

1518630367

数据在入过程中样本量减少

1

如图,原本有3w多样本, 最后就只有4k多,该如何定位该问题? 脚本如下: [rm -rf llama3_finetune_pth/* output_dir=llama3_finetune_pth config_py=xtuner/configs/llama/llama3_8b_instruct/llama3_8b_instruct_qlora_alpaca_e3.py CUDA_VISIBLE_DEVICES=0,1 NPROC_PER_NODE=2 xtuner train ${config_py} --work-dir ${output_dir} --deepspeed deepspeed_zero2 --seed 1024](url)

Jason8Kang

OSError: We couldn't connect to 'https://huggingface.co' to load this file

16

Traceback (most recent call last): File "/root/autodl-tmp/xtuner/xtuner/tools/model_converters/pth_to_hf.py", line 168, in main() File "/root/autodl-tmp/xtuner/xtuner/tools/model_converters/pth_to_hf.py", line 81, in main model = BUILDER.build(cfg.model) File "/root/miniconda3/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build return self.build_func(cfg, *args, **kwargs,...

Artist2001

华为Ascend NPU卡训练的时候，需要设置哪些参数？

1

目前有华为NPU卡可以用来训练测试，不清楚需要设置哪些参数？

apachemycat

xtuner
xtuner copied to clipboard

Metadata

[Bug] fix internlm2 flash attn

[Docs] Add RLHF content

合并llama3时出现如下报错，这个问题再使用zero3时也出现了

log输出中的time的意思

无法启动训练，似乎是mmengine有问题

报错FileNotFoundError: [Errno 2] No such file or directory: '/app/work_dirs/chatglm2_6b_qlora_lawyer_e3_copy/20240514_035914/vis_data/eval_outputs_iter_499.txt'

如何再8*A100上预训练128k长度的llama3？

数据在入过程中样本量减少

OSError: We couldn't connect to 'https://huggingface.co' to load this file

华为Ascend NPU卡训练的时候，需要设置哪些参数？

← Metadata

Owner

Metadata

xtuner xtuner copied to clipboard

Metadata

← Metadata

Owner

Metadata

xtuner
xtuner copied to clipboard