刘维克 comments

Results 10 comments of


                                            刘维克

无法启动训练，似乎是mmengine有问题

# 我遇到了相同的问题,我通过修改训练启动命令解决了. - 我遇到了同一样报错` 07/29 22:44:31 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry...

无法启动训练，似乎是mmengine有问题

> # 我遇到了相同的问题,我通过修改训练启动命令解决了. > * 我遇到了同一样报错` 07/29 22:44:31 - mmengine - �[5m�[4m�[33mWARNING�[0m - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current...

无法启动训练，似乎是mmengine有问题

# DEBUG成功:我把nijia的环境加入了PATH,现在可以使用`deepspeed`了,完全解决了问题. - 我参考了几乎相同的issue[使用 deepspeed_zero2 训练启动失败 #80](https://github.com/InternLM/xtuner/issues/80).佬建议: > 目前来看并不是因为显存不足，同时我在两张T4上能够正常启动DeepSpeed训练 > 怀疑是DeepSpeed安装的问题，建议您可以尝试使用命令ds_report检查一下是否有错误？ > 如果上述命令一切正常，可以尝试运行一些DeepSpeed官方提供的examples脚本，如[DeepSpeed_CIFAR](https://github.com/microsoft/DeepSpeedExamples/tree/master/training/cifar)，验证DeepSpeed能否正常启动～ - 我测试了这个官方检测脚本,发现是`ninjia`没有识别.当我在jupyter中增加 ```python import os os.environ['PATH'] += ':/home/aistudio/.local/bin' # for ninja os.environ['PATH'] += ':/home/aistudio/.local/lib/python3.10/site-packages/ninja/data/bin' ``` 之后,使用`deepspeed`训练便完全正常了! 附上命令`!/home/aistudio/.local/bin/xtuner...

Setting class_weight causes CUDA error

> [@jhaggle](https://github.com/jhaggle) I think the problem is not the list comprehension by itself but the fact that `label` can also include elements with the `ignore_index` (default 255). This makes the...

RuntimeError: stack expects each tensor to be equal size, but got [1, 677, 347] at entry 0 and [1, 512, 512] at entry 1

`'RandomResize'` will got `RuntimeError: stack expects each tensor to be equal size, but got [1, 512, 512] at entry 0 and [1, 512, 527] at entry 5`. So I changed...

[Bug] cpu not fully used, data_time load slow

This is strange. Based on your information, it can be seen that you have successfully started 12 `num_workers` . However, only two CPU threads are occupied. Does your server have...

Unable to use finetuned LayoutLMV3 for object detection task model for testing

Just copy `layoutlmv3-base-finetuned-publaynet/config.json` to `/content/output_dir/`. The code need it.

[BUG] AttributeError: 'super' object has no attribute '__sklearn_tags__'

`!pip install autogluon scikit-learn==1.5.2` works on __kaggle T4x2__. Ref: [['super' object has no attribute '__sklearn_tags__'](https://stackoverflow.com/questions/79290968/super-object-has-no-attribute-sklearn-tags)](https://stackoverflow.com/questions/79290968/super-object-has-no-attribute-sklearn-tags)

Paddle3.0beta联合paddledetection2.7进行模型训练时报错

I add `assigned_gt_index = paddle.cast(assigned_gt_index, dtype="int32")` before this line of code `assigned_gt_index = assigned_gt_index + batch_ind * num_max_boxes`. It works for `paddle3.0.0b1` both on `Ubuntu 22.04 cu188 RTX4090` and `WSL2...

使用xtuner微调InternLM2-7B-chat报错

从`>=0.1.21`[[Bugs] fix dispatch bugs](https://github.com/InternLM/xtuner/commit/c2328a02531ed17a96aef1c82584118fe2bac6bf)开始,默认检测`rope_theta`,当然新的模型如`interlm2`系列的`config.json`都是有这个参数的.然而旧的模型如`internlm-chat-7b`都是没有这个参数的,我查到旧模型的最新tag的`internlm-chat-7b`模型(在modelscope和hf上的repo)都是没有该参数的.如果你使用了新版的xtuner配合`internlm`旧模型,你可以手动在`config.json`增加上` "rope_theta": 1000000`这一行.我测试过这样是可以训练的.然而我不懂相关参数的意义和这么做可能引起的其他后果.