unsloth
unsloth copied to clipboard
Bitsandbytes issue
I'm using a slightly modified notebook (like https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9pRGnnOvE86MVvR?usp=sharing) to finetune a qwen2 model, specifically, my installation instructions are:
#%%capture
!mamba install --force-reinstall aiohttp -y
!pip install -U "xformers<0.0.26" --index-url https://download.pytorch.org/whl/cu121
!pip install --upgrade "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
# Temporary fix for https://github.com/huggingface/datasets/issues/6753
!pip3 install datasets==2.16.0 fsspec==2023.10.0 gcsfs==2023.10.0
!pip3 install -U wandb
When it comes to resuming my training, the trainer_stats = trainer.train(resume_from_checkpoint=True) cells runs into the following error:
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\\ /| Num examples = 4,233,923 | Num Epochs = 3
O^O/ \_/ \ Batch size per device = 1 | Gradient Accumulation steps = 2
\ / Total batch size = 2 | Total steps = 6,350,883
"-____-" Number of trainable parameters = 40,370,176
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[24], line 1
----> 1 trainer_stats = trainer.train(resume_from_checkpoint=True)
File <string>:140, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
File <string>:404, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:2387, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
2385 if parameters == [p for p in model.parameters()]:
2386 return model.clip_grad_norm_(max_norm, norm_type)
-> 2387 self.unscale_gradients()
2388 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
File /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:2331, in Accelerator.unscale_gradients(self, optimizer)
2329 while isinstance(opt, AcceleratedOptimizer):
2330 opt = opt.optimizer
-> 2331 self.scaler.unscale_(opt)
File /opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:325, in unscale_(self, optimizer)
0 <Error retrieving source code with stack_data see ipython/ipython#13598>
RuntimeError: unscale_() has already been called on this optimizer since the last update().
I think that something's gone wrong compatibility-wise. I've tried using different versions of pytorch, accelerate, transformers and trl, but the issue persists.
Please advise
@StrangeTcy Did you set fp16 = True or bf16 = True in the trainer args?
PS if these are Kaggle install instructions - there are updated ones here: https://www.kaggle.com/danielhanchen/kaggle-llama-3-2-1b-3b-unsloth-notebook
@danielhanchen 1.
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
),
)
-- yes, I did
- I'll try that & get back to you, thanks
ETA:
/opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint_rng_state = torch.load(rng_file)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 trainer_stats = trainer.train(resume_from_checkpoint = True)
File <string>:140, in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
File <string>:425, in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:159, in AcceleratedOptimizer.step(self, closure)
156 if self.scaler is not None:
157 self.optimizer.step = self._optimizer_patched_step_method
--> 159 self.scaler.step(self.optimizer, closure)
160 self.scaler.update()
162 if not self._accelerate_step_called:
163 # If the optimizer step was skipped, gradient overflow was detected.
File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:454, in GradScaler.step(self, optimizer, *args, **kwargs)
448 self.unscale_(optimizer)
450 assert (
451 len(optimizer_state["found_inf_per_device"]) > 0
452 ), "No inf checks were recorded for this optimizer."
--> 454 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
456 optimizer_state["stage"] = OptState.STEPPED
458 return retval
File /opt/conda/lib/python3.10/site-packages/torch/amp/grad_scaler.py:352, in GradScaler._maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs)
350 retval: Optional[float] = None
351 if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
--> 352 retval = optimizer.step(*args, **kwargs)
353 return retval
File /opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py:214, in patch_optimizer_step.<locals>.patched_step(*args, **kwargs)
212 def patched_step(*args, **kwargs):
213 accelerated_optimizer._accelerate_step_called = True
--> 214 return method(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:130, in LRScheduler.__init__.<locals>.patch_track_step_called.<locals>.wrap_step.<locals>.wrapper(*args, **kwargs)
128 opt = opt_ref()
129 opt._opt_called = True # type: ignore[union-attr]
--> 130 return func.__get__(opt, opt.__class__)(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:484, in Optimizer.profile_hook_step.<locals>.wrapper(*args, **kwargs)
479 else:
480 raise RuntimeError(
481 f"{func} must return None or a tuple of (new_args, new_kwargs), but got {result}."
482 )
--> 484 out = func(*args, **kwargs)
485 self._optimizer_step_code()
487 # call optimizer step post hooks
File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:291, in Optimizer8bit.step(self, closure)
288 self.init_state(group, p, gindex, pindex)
290 self.prefetch_state(p)
--> 291 self.update_step(group, p, gindex, pindex)
292 torch.cuda.synchronize()
293 if self.is_paged:
294 # all paged operation are asynchronous, we need
295 # to sync to make sure all tensors are in the right state
File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/optim/optimizer.py:569, in Optimizer2State.update_step(self, group, p, gindex, pindex)
567 state["max2"], state["new_max2"] = state["new_max2"], state["max2"]
568 elif state["state1"].dtype == torch.uint8 and config["block_wise"]:
--> 569 F.optimizer_update_8bit_blockwise(
570 self.optimizer_name,
571 grad,
572 p,
573 state["state1"],
574 state["state2"],
575 config["betas"][0],
576 config["betas"][1],
577 config["betas"][2] if len(config["betas"]) >= 3 else 0.0,
578 config["alpha"],
579 config["eps"],
580 step,
581 config["lr"],
582 state["qmap1"],
583 state["qmap2"],
584 state["absmax1"],
585 state["absmax2"],
586 config["weight_decay"],
587 gnorm_scale=gnorm_scale,
588 skip_zeros=config["skip_zeros"],
589 )
File /opt/conda/lib/python3.10/site-packages/bitsandbytes/functional.py:1843, in optimizer_update_8bit_blockwise(optimizer_name, g, p, state1, state2, beta1, beta2, beta3, alpha, eps, step, lr, qmap1, qmap2, absmax1, absmax2, weight_decay, gnorm_scale, skip_zeros)
1832 is_on_gpu([p, g, state1, state2, qmap1, qmap2, absmax1, absmax2])
1834 prev_device = pre_call(g.device)
1835 optim_func(
1836 get_ptr(p),
1837 get_ptr(g),
1838 get_ptr(state1),
1839 get_ptr(state2),
1840 ct.c_float(beta1),
1841 ct.c_float(beta2),
1842 ct.c_float(beta3),
-> 1843 ct.c_float(alpha),
1844 ct.c_float(eps),
1845 ct.c_int32(step),
1846 ct.c_float(lr),
1847 get_ptr(qmap1),
1848 get_ptr(qmap2),
1849 get_ptr(absmax1),
1850 get_ptr(absmax2),
1851 ct.c_float(weight_decay),
1852 ct.c_float(gnorm_scale),
1853 ct.c_bool(skip_zeros),
1854 ct.c_int32(g.numel()),
1855 )
1856 post_call(prev_device)
TypeError: must be real number, not NoneType
-- that's the error I'm getting now, with unsloth being installed the new way. My guess is something's wrong with the quantization, but it's hard to debug from within kaggle
@StrangeTcy Ok that looks like a bitsandbytes issue - will investigate
@StrangeTcy If you initially trained with bitsandbytes < 0.44 and then tried to resume training with 0.44+ this can happen. I would recommend trying again with bitsandbytes==0.43.3.
@danielhanchen I also saw this question come up on Discord. Will try to have that fixed in the next bitsandbytes patch release.
@matthewdouglas interesting; I'll try that & report back, thanks
ETA: yes, apparently that works, thanks!
Thanks @matthewdouglas ! :) Sorry on the issue @StrangeTcy