QinLuo issues

Results 30 issues of


QinLuo

[BUG]: RuntimeError: value cannot be converted to type float without overflow

### 🐛 Describe the bug When using GeminiPlugin, I got a RuntimeError: `RuntimeError: value cannot be converted to type float without overflow` the full traceback: ``` Traceback (most recent call...

bug

[CLI]: Wandb finish hangging and 500 Server Error in debug-internal.log

### Describe the bug After executing run.log({"a": 99.0, "c": 85.0, "custom_step": 1000}, step=None) and subsequently closing it with run.finish(), the process hangs. The following warnings and upload progress messages are...

cli

How can I display `hash` column at the first in aim-v4.x?

Now, the `hparams` are displayed at the beginning of the table, and the last column is `hash` , looks like: ![image](https://github.com/aimhubio/aim/assets/1772912/56525273-4d02-4e8a-8484-9b7a8a828d07) One can go deeper onclik `hash` column: ![image](https://github.com/aimhubio/aim/assets/1772912/709f458a-60ef-4d40-a8e4-33a39b100ea8) and...

type / question

could we start UI from a remote repo: aim://host:port?

## ❓Question

type / question

[fsdp] impl save/load shard model/optimizer

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A concise...

[FEATURE]: save/load sharded model/optimizer in TorchFSDPPlugin

### Describe the feature The functionality for saving and loading shared models and optimizers is currently not implemented, leading to the raising of a `NotImplementedError`. How can one proceed to...

enhancement

[BUG]: RuntimeError: Failed to replace block_sparse_moe of type MixtralSparseMoeBlock with EPMixtralSparseMoeBlock with the exception: CUDA out of memory

### 🐛 Describe the bug When training the Mixture of Experts (MoE) model with code snippets in the application/ColossalMoE, I encountered Out of Memory (OOM) issues at the beginning. ```...

bug

[BUG]: bugs surfaced while training MoE(Mixtral)

### 🐛 Describe the bug With the main branch `applications/ColossalMoE`, I got such error: ``` grad = grad.to(master_moe_param.dtype).to(master_moe_param.device) AttributeError: 'NoneType' object has no attribute 'to' ``` start script: ``` NUM_GPU=2...

bug

[FEATURE]: Integrate GaLore into Colossalai Optimizer(Gemini/Hybrid)

### Describe the feature A recent paper titled "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" (https://arxiv.org/pdf/2403.03507.pdf) demonstrates a remarkable memory-efficient approach during the training of large language models (LLMs)....

enhancement