One

Results 109 comments of One

That same issue happened on the latest version, `v0.114.2`, when the host has the home directory mounted as remote filesystem. Also solved through setting `"remote.SSH.useExecServer": false`

We've tested the following installation instructions and pip package. What error message did you encounter, can you post it here? ``` conda create -y --name openchat conda activate openchat conda...

Thanks! I have tested the kernel and it does work. However, the padding elements may be uninitialized, resulting in NaN/inf in the forward and backward passes. Can we include a...

BTW, here is the code used for testing: ```python from typing import Any import torch from tqdm import tqdm from flash_attn import flash_attn_varlen_func def test_flash_attn_padding( seed: int = 0, test_rounds:...

@Nerogar Thanks for your experiments! I'll try your implementation

BTW, one possible alternative: 16+8 optimizer https://arxiv.org/pdf/2309.12381.pdf It stores 8 extra bits of mantissa, achieving the same model accuracy as FP32 optimizer, at the low cost of 16% more VRAM

@AmericanPresidentJimmyCarter Thanks for your implementation! I've seen the comment on unstable weight decay. Could you please try adding the weight decay to the update and then stochastically add the update...

Update: I've written a fused CUDA version of 16+16 AdamW optimizer: https://github.com/imoneoi/bf16_fused_adam. With an extra 16-bit mantissa term, it is equivalent to fp32 master weights.

Hi @timoffex Here is the info of my wandb training code 1. wandb version 0.18.1 2. 5-6 scalars per step every training iteration, same on each iteration, and 1-2 scalars...

BTW What does `wandb.log` do? Is there any blocking operation, is it related to disk read/write latency? I only observed periodic blocking in one environment, but it was fine in...