CoLLiE issues

Results 26 CoLLiE issues

Sort by recently updated

你好，怎么让保存的模型能够分片，而不是保存一个几十G的大模型

你好，怎么让保存的模型能够分片，而不是保存一个几十G的大模型。我看了一遍这里面的参数，但是发现是完整保存的，而不是分片的，这个有办法吗

459737087

enhancement

关于 adalomo 没有 loss_scaler 只有 loss_scale 的问题

![image](https://github.com/OpenLMLab/collie/assets/37737346/6af3780f-45c1-4a3f-8760-e871b0183400) 这是如果 clip_grad_norm 不是 None 会有问题，所以对于 adalomo 是不需要 clip_grad_norm 吗？

HappyLynn

chatGLM2 好像目前不支持ptuning训练，有计划什么时候支持么

BlueSkyyyyyy

bug

AdaLomo optimizer step method

Hi @KaiLv69 thanks for the writeup and implementation for AdaLomo! It looks like it is [missing](https://github.com/OpenLMLab/collie/blob/dev/collie/optim/adalomo.py) the `step` method which torch needs to use this in other frameworks. Can you...

winglian

[BUG] Evaluation 时使用并行可能不会完整地遍历一遍数据

猜测是并行 size 或 batch size 设置不当（无法整除数据量），可能会有数据被重复计算。

KYLN24

使用数据类_ShardContainer遇到错误

使用数据类_ShardContainer，会报下面错误，请问是什么原因？（dev分支） `ValueError: mmap closed or invalid` ![image](https://github.com/OpenLMLab/collie/assets/111968302/fb31dbed-ff8c-40ca-86d0-5a5960e1b0fc)

xuguohai

How to convert parallel state_dict to normal state_dict?

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance. ``` collie_dp0_pp0_tp0.pt collie_zero_dp0_pp0_tp0.pt...

JinchaoLove

Could Lomo class support `param_groups`?

As a subclass of `torch.optim.Optimizer`, could `collie.optim.Lomo` support `param_groups`, by calling `super().__init__(params, defaults)`. So that we can fit for more schedulers and use `per-parameter` method to filter out some modules...

JinchaoLove

Support for LLaMA-2 70B with Grouped-Query Attention

Due to the Grouped-Query Attention introduced in LLaMA-2 70B，[llama issue](https://github.com/facebookresearch/llama/issues/384)，it cannot be loaded into the collie implementation of LLaMA. Hope LLaMA-2 70B can be support in collie. Thanks ``` Traceback...

kaiwang13

bug

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM

8张 V100 显卡，开启 Zero3，TP=1，PP=1，DP=8，LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够，不是显存不够)，物理内存 512GB。原因是 dev 分支中，base.py 304行， state_dict = {} if not is_zero3_enabled(config) or env.dp_rank == 0 \ or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit \...

xiaopqr

bug

CoLLiE
CoLLiE copied to clipboard

Metadata

你好，怎么让保存的模型能够分片，而不是保存一个几十G的大模型

关于 adalomo 没有 loss_scaler 只有 loss_scale 的问题

chatGLM2 好像目前不支持ptuning训练，有计划什么时候支持么

AdaLomo optimizer step method

[BUG] Evaluation 时使用并行可能不会完整地遍历一遍数据

使用数据类_ShardContainer遇到错误

How to convert parallel state_dict to normal state_dict?

Could Lomo class support `param_groups`?

Support for LLaMA-2 70B with Grouped-Query Attention

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM

← Metadata

Owner

Metadata

CoLLiE CoLLiE copied to clipboard

Metadata

← Metadata

Owner

Metadata

CoLLiE
CoLLiE copied to clipboard