[fix] add parameters arg into AdamWMini

Open megemini opened this issue 6 months ago • 4 comments

Before submitting

[x] Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

[x] Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Bug fixes

PR changes

APIs

Description

在 AdamWMini 中增加 parameters 参数，以兼容 paddle 中的 optimizers
增加测试用例，并加入 llm 的测试中
增加 AdamWMini 的 docstring

本地测试通过：


(venv39dev)  ✘ shun@shun-B660M-Pro-RS  ~/Documents/Projects/paddle/megemini/PaddleNLP/tests/utils/test_optimizers   adam_mini_param_arg ✚  python -m unittest test_adamw_mini.py
/home/shun/venv39dev/lib/python3.9/site-packages/paddle/utils/cpp_extension/extension_utils.py:711: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
W0625 18:21:21.107420 83151 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 12.2, Runtime API Version: 11.7
W0625 18:21:21.108110 83151 gpu_resources.cc:164] device: 0, cuDNN Version: 8.5.
W0625 18:21:21.767901 83151 gpu_resources.cc:306] WARNING: device: 0. The installed Paddle is compiled with CUDNN 8.9, but CUDNN version in your machine is 8.5, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
/home/shun/venv39dev/lib/python3.9/site-packages/paddle/base/dygraph/math_op_patch.py:183: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  return int(np.array(var))
[2025-06-25 18:21:22,070] [    INFO] - 
Adam-mini found blocks:
[2025-06-25 18:21:22,071] [    INFO] - - 1 embedding layers
[2025-06-25 18:21:22,071] [    INFO] - - 1 output layers
[2025-06-25 18:21:22,071] [    INFO] - - 2 Query and Key layers
[2025-06-25 18:21:22,071] [    INFO] - - 1 Value layers
[2025-06-25 18:21:22,071] [    INFO] - - 1 Attention projection layers
[2025-06-25 18:21:22,071] [    INFO] - - 2 MLP layers

.[2025-06-25 18:21:22,333] [ WARNING] - Warning: `named_parameters` is None, AdamWMini will use `parameters` instead, which may be incorrect.
[2025-06-25 18:21:22,337] [    INFO] - 
Adam-mini found blocks:
[2025-06-25 18:21:22,337] [    INFO] - - 1 embedding layers
[2025-06-25 18:21:22,337] [    INFO] - - 0 output layers
[2025-06-25 18:21:22,337] [    INFO] - - 0 Query and Key layers
[2025-06-25 18:21:22,337] [    INFO] - - 0 Value layers
[2025-06-25 18:21:22,337] [    INFO] - - 0 Attention projection layers
[2025-06-25 18:21:22,337] [    INFO] - - 14 MLP layers

[2025-06-25 18:21:22,337] [ WARNING] - Warning: No output layers found (ignore if using weight tying)
[2025-06-25 18:21:22,337] [ WARNING] - Warning: No Query/Key layers found
[2025-06-25 18:21:22,337] [ WARNING] - Warning: No Value layers found
[2025-06-25 18:21:22,337] [ WARNING] - Warning: No attention projection layers found
..
----------------------------------------------------------------------
Ran 3 tests in 1.658s

OK

注意：上述测试需要修改 paddlenlp/trainer/trainer_utils.py，去掉 from ..transformers import get_gpt_pp_schedule, get_llama_pp_schedule，否则提示引入错误：


E   ImportError: cannot import name 'get_gpt_pp_schedule' from 'paddlenlp.transformers' (/home/shun/Documents/Projects/paddle/megemini/PaddleNLP/paddlenlp/transformers/__init__.py)

这个 https://github.com/PaddlePaddle/PaddleNLP/pull/10759 PR 对 import 做的修改，


try:
    from .modeling_auto_pp import *
except (ImportError, ModuleNotFoundError):
    # Temporarily adapt to the release version of Paddle, which can be removed later.
    pass

会导致导入错误，至少我这里是这样 ... ...

关联 https://github.com/PaddlePaddle/PaddleNLP/pull/10413

@DrownFish19 帮忙看看，感谢～

Jun 25 '25 10:06 megemini

Thanks for your contribution!

Jun 25 '25 10:06 paddle-bot[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 46.77%. Comparing base (06378bb) to head (1651c18). :warning: Report is 83 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop   #10774   +/-   ##
========================================
  Coverage    46.77%   46.77%           
========================================
  Files          802      802           
  Lines       133646   133651    +5     
========================================
+ Hits         62508    62511    +3     
- Misses       71138    71140    +2

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jun 25 '25 11:06 codecov[bot]

在 aistudio 上测试一下 llama 的测试用例，提示出错了，目前分析：

AdamWMini 初始化可以了，不会报 parameters 参数问题
我将 AdamWMini 改回了最早的那个没有分块的版本，也报如下错误，所以，目前不清楚这个错误是啥导致的
目前测试看来，后向出问题了，需要先把这个问题定位一下，再看看后续参数更新有木有问题（由于 AdamWMini 是使用 named_parameters 的，所以，这里使用 parameters 不确定后面会不会有分块问题，也就是 shape 对不上的情况～）


aistudio@jupyter-942478-8790893:~/PaddleNLP$ python -m pytest tests/llm/test_adamw_mini.py::FinetuneTest_0_llama::test_finetune
F                                                                                                                                               [100%]
====================================================================== FAILURES =======================================================================
_________________________________________________________ FinetuneTest_0_llama.test_finetune __________________________________________________________

self = <tests.llm.test_adamw_mini.FinetuneTest_0_llama testMethod=test_finetune>

    def test_finetune(self):
        finetune_config = load_test_config(self.config_path, "finetune", self.model_dir)
    
        finetune_config["dataset_name_or_path"] = self.data_dir
        finetune_config["output_dir"] = self.output_dir
    
        with argv_context_guard(finetune_config):
            from run_finetune import main
    
>           main()

tests/llm/test_adamw_mini.py:53: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
llm/run_finetune.py:478: in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
paddlenlp/trainer/trainer.py:991: in train
    return self._inner_training_loop(
paddlenlp/trainer/trainer.py:1240: in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, step_control=step_control)
paddlenlp/trainer/trainer.py:2538: in training_step
    self.scaler.scale(loss).backward()
../external-libraries/lib/python3.10/site-packages/decorator.py:235: in fun
    return caller(func, *(extras + args), **kw)
../external-libraries/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py:40: in __impl__
    return wrapped_func(*args, **kwargs)
../external-libraries/lib/python3.10/site-packages/paddle/base/framework.py:722: in __impl__
    return func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [85196.54687500]), grad_tensor = [], retain_graph = False

    @framework.dygraph_only
    def backward(
        self: Tensor,
        grad_tensor: Tensor | None = None,
        retain_graph: bool = False,
    ) -> None:
        """
        Run backward of current Graph which starts from current Tensor.
    
        The new gradient will accumulate on previous gradient.
    
        You can clear gradient by ``Tensor.clear_grad()`` .
    
        Args:
            grad_tensor(Tensor|None, optional): initial gradient values of the current Tensor. If `grad_tensor` is None,
                the initial gradient values of the current Tensor would be Tensor filled with 1.0;
                if `grad_tensor` is not None, it must have the same length as the current Tensor.
                The default value is None.
            retain_graph(bool, optional): If False, the graph used to compute grads will be freed. If you would
                like to add more ops to the built graph after calling this method( :code:`backward` ), set the parameter
                :code:`retain_graph` to True, then the grads will be retained. Thus, setting it to False is much more memory-efficient.
                Defaults to False.
    
        Returns:
            None
    
        Examples:
            .. code-block:: python
    
                >>> import paddle
                >>> x = paddle.to_tensor(5., stop_gradient=False)
                >>> for i in range(5):
                ...     y = paddle.pow(x, 4.0)
                ...     y.backward()
                ...     print("{}: {}".format(i, x.grad))
                0: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                500.)
                1: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                1000.)
                2: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                1500.)
                3: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                2000.)
                4: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                2500.)
    
                >>> x.clear_grad()
                >>> print("{}".format(x.grad))
                Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                0.)
    
                >>> grad_tensor=paddle.to_tensor(2.)
                >>> for i in range(5):
                ...     y = paddle.pow(x, 4.0)
                ...     y.backward(grad_tensor)
                ...     print("{}: {}".format(i, x.grad))
                0: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                1000.)
                1: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                2000.)
                2: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                3000.)
                3: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                4000.)
                4: Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=False,
                5000.)
        """
        if framework.in_dygraph_mode():
            if in_profiler_mode():
                record_event = profiler.RecordEvent(
                    "Gradient Backward", profiler.TracerEventType.Backward
                )
                record_event.begin()
            if grad_tensor is not None:
                assert isinstance(
                    grad_tensor, core.eager.Tensor
                ), "The type of grad_tensor must be paddle.Tensor"
    
                assert (
                    grad_tensor.shape == self.shape
                ), f"Tensor shape not match, Tensor of grad_tensor [ {grad_tensor.name} ] with shape {grad_tensor.shape} mismatch Tensor [ {self.name} ] with shape {self.shape}"
    
            if grad_tensor is None:
                grad_tensor = []
            else:
                grad_tensor = [grad_tensor]
            if _grad_scalar:
                # When using amp with Fleet DistributedStrategy, we do loss scaling implicitly.
                self = _grad_scalar.scale(self)
    
>           core.eager.run_backward([self], grad_tensor, retain_graph)
E           OSError: (External) OSError: (External) Exception: Not supported to retrieve a tensor saved by autograd multiple times that is no need to recompute.Please check your `keys_ignore_to_save`.
E           
E           At:
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/refined_recompute.py(369): inner_pack
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/functional/common.py(2310): linear
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/deepseek_v2/fp8_linear.py(75): fp8_linear
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/common.py(223): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(687): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(1278): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(1639): custom_forward
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/refined_recompute.py(404): unpack
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py(371): backward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/framework.py(722): __impl__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py(40): __impl__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/decorator.py(235): fun
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(2538): training_step
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(1240): _inner_training_loop
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(991): train
E             /home/aistudio/PaddleNLP/./llm/run_finetune.py(478): main
E             /home/aistudio/PaddleNLP/tests/llm/test_adamw_mini.py(53): test_finetune
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(549): _callTestMethod
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(591): run
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(650): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/unittest.py(351): runtest
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(178): pytest_runtest_call
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(246): <lambda>
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(344): from_call
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(245): call_and_report
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(136): runtestprotocol
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(117): pytest_runtest_protocol
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(367): pytest_runtestloop
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(343): _main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(289): wrap_session
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(336): pytest_cmdline_main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/config/__init__.py(175): main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/config/__init__.py(201): console_main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pytest/__main__.py(9): <module>
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/runpy.py(86): _run_code
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/runpy.py(196): _run_module_as_main
E           
E             [Hint: ret should not be null.] (at ../paddle/fluid/pybind/eager_utils.cc:2625)
E             [operator < linear > error]
E           
E           At:
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/functional/common.py(2310): linear
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/deepseek_v2/fp8_linear.py(75): fp8_linear
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/common.py(223): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(687): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(1278): forward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/nn/layer/layers.py(1571): __call__
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/llama/modeling.py(1639): custom_forward
E             /home/aistudio/PaddleNLP/paddlenlp/transformers/refined_recompute.py(404): unpack
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py(371): backward
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/framework.py(722): __impl__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/paddle/base/wrapped_decorator.py(40): __impl__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/decorator.py(235): fun
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(2538): training_step
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(1240): _inner_training_loop
E             /home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py(991): train
E             /home/aistudio/PaddleNLP/./llm/run_finetune.py(478): main
E             /home/aistudio/PaddleNLP/tests/llm/test_adamw_mini.py(53): test_finetune
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(549): _callTestMethod
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(591): run
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/unittest/case.py(650): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/unittest.py(351): runtest
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(178): pytest_runtest_call
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(246): <lambda>
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(344): from_call
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(245): call_and_report
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(136): runtestprotocol
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/runner.py(117): pytest_runtest_protocol
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(367): pytest_runtestloop
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(343): _main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(289): wrap_session
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/main.py(336): pytest_cmdline_main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_callers.py(121): _multicall
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_manager.py(120): _hookexec
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pluggy/_hooks.py(512): __call__
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/config/__init__.py(175): main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/_pytest/config/__init__.py(201): console_main
E             /home/aistudio/external-libraries/lib/python3.10/site-packages/pytest/__main__.py(9): <module>
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/runpy.py(86): _run_code
E             /opt/conda/envs/python35-paddle120-env/lib/python3.10/runpy.py(196): _run_module_as_main
E           
E             [Hint: ret should not be null.] (at ../paddle/fluid/pybind/eager_utils.cc:2672)

../external-libraries/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py:371: OSError
---------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------
[2025-06-25 21:17:40,735] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2025-06-25 21:17:40,735] [   DEBUG] - ============================================================
[2025-06-25 21:17:40,735] [   DEBUG] -      Model Configuration Arguments      
[2025-06-25 21:17:40,736] [   DEBUG] - paddle commit id              : 129b5cec5427ca9d634f490f28263ad274aacdf8
[2025-06-25 21:17:40,736] [   DEBUG] - paddlenlp commit id           : 06378bb4e591363e14f44e2e8b0b90f39ee5527e.dirty
[2025-06-25 21:17:40,736] [   DEBUG] - actscale_moving_rate          : 0.01
[2025-06-25 21:17:40,736] [   DEBUG] - aistudio_repo_id              : None
[2025-06-25 21:17:40,736] [   DEBUG] - aistudio_repo_license         : Apache License 2.0
[2025-06-25 21:17:40,736] [   DEBUG] - aistudio_repo_private         : True
[2025-06-25 21:17:40,736] [   DEBUG] - aistudio_token                : None
[2025-06-25 21:17:40,736] [   DEBUG] - apply_hadamard                : False
[2025-06-25 21:17:40,736] [   DEBUG] - apply_online_actscale_step    : 200
[2025-06-25 21:17:40,736] [   DEBUG] - attention_probs_dropout_prob  : 0.1
[2025-06-25 21:17:40,736] [   DEBUG] - continue_training             : True
[2025-06-25 21:17:40,736] [   DEBUG] - flash_mask                    : False
[2025-06-25 21:17:40,736] [   DEBUG] - fp8_format_type               : hybrid
[2025-06-25 21:17:40,736] [   DEBUG] - from_aistudio                 : False
[2025-06-25 21:17:40,736] [   DEBUG] - fuse_attention_ffn            : None
[2025-06-25 21:17:40,736] [   DEBUG] - fuse_attention_qkv            : None
[2025-06-25 21:17:40,736] [   DEBUG] - hadamard_block_size           : 32
[2025-06-25 21:17:40,737] [   DEBUG] - hidden_dropout_prob           : 0.1
[2025-06-25 21:17:40,737] [   DEBUG] - lokr                          : False
[2025-06-25 21:17:40,737] [   DEBUG] - lokr_dim                      : 8
[2025-06-25 21:17:40,737] [   DEBUG] - lokr_path                     : None
[2025-06-25 21:17:40,737] [   DEBUG] - lora                          : False
[2025-06-25 21:17:40,737] [   DEBUG] - lora_path                     : None
[2025-06-25 21:17:40,737] [   DEBUG] - lora_plus_scale               : 1.0
[2025-06-25 21:17:40,737] [   DEBUG] - lora_rank                     : 8
[2025-06-25 21:17:40,737] [   DEBUG] - lora_use_mixer                : False
[2025-06-25 21:17:40,737] [   DEBUG] - lorapro                       : False
[2025-06-25 21:17:40,737] [   DEBUG] - lorapro_scaling_factor        : 2.0
[2025-06-25 21:17:40,737] [   DEBUG] - lorapro_x_mode                : zero
[2025-06-25 21:17:40,737] [   DEBUG] - model_name_or_path            : __internal_testing__/tiny-random-llama
[2025-06-25 21:17:40,737] [   DEBUG] - neftune                       : False
[2025-06-25 21:17:40,737] [   DEBUG] - neftune_noise_alpha           : 5.0
[2025-06-25 21:17:40,737] [   DEBUG] - num_prefix_tokens             : 128
[2025-06-25 21:17:40,737] [   DEBUG] - pissa                         : False
[2025-06-25 21:17:40,737] [   DEBUG] - prefix_path                   : None
[2025-06-25 21:17:40,737] [   DEBUG] - prefix_tuning                 : False
[2025-06-25 21:17:40,738] [   DEBUG] - qlora_weight_blocksize        : 64
[2025-06-25 21:17:40,738] [   DEBUG] - qlora_weight_double_quant     : False
[2025-06-25 21:17:40,738] [   DEBUG] - qlora_weight_double_quant_block_size: 256
[2025-06-25 21:17:40,738] [   DEBUG] - quant_input_grad              : False
[2025-06-25 21:17:40,738] [   DEBUG] - quant_weight_grad             : False
[2025-06-25 21:17:40,738] [   DEBUG] - reft                          : False
[2025-06-25 21:17:40,738] [   DEBUG] - rope_scaling_factor           : 1.0
[2025-06-25 21:17:40,738] [   DEBUG] - rslora                        : False
[2025-06-25 21:17:40,738] [   DEBUG] - save_to_aistudio              : False
[2025-06-25 21:17:40,738] [   DEBUG] - strategy_name                 : None
[2025-06-25 21:17:40,738] [   DEBUG] - strategy_type                 : None
[2025-06-25 21:17:40,738] [   DEBUG] - tokenizer_name_or_path        : None
[2025-06-25 21:17:40,738] [   DEBUG] - use_fast_layer_norm           : False
[2025-06-25 21:17:40,738] [   DEBUG] - use_long_sequence_strategies  : False
[2025-06-25 21:17:40,738] [   DEBUG] - use_mora                      : False
[2025-06-25 21:17:40,738] [   DEBUG] - use_quick_lora                : False
[2025-06-25 21:17:40,738] [   DEBUG] - vera                          : False
[2025-06-25 21:17:40,738] [   DEBUG] - vera_rank                     : 8
[2025-06-25 21:17:40,739] [   DEBUG] - weight_quantize_algo          : None
[2025-06-25 21:17:40,739] [   DEBUG] - 
[2025-06-25 21:17:40,739] [   DEBUG] - ============================================================
[2025-06-25 21:17:40,739] [   DEBUG] -       Data Configuration Arguments      
[2025-06-25 21:17:40,739] [   DEBUG] - paddle commit id              : 129b5cec5427ca9d634f490f28263ad274aacdf8
[2025-06-25 21:17:40,739] [   DEBUG] - paddlenlp commit id           : 06378bb4e591363e14f44e2e8b0b90f39ee5527e.dirty
[2025-06-25 21:17:40,739] [   DEBUG] - autoregressive                : False
[2025-06-25 21:17:40,739] [   DEBUG] - chat_template                 : None
[2025-06-25 21:17:40,739] [   DEBUG] - dataset_name_or_path          : ./tests/fixtures/llm/data/
[2025-06-25 21:17:40,739] [   DEBUG] - eval_with_do_generation       : False
[2025-06-25 21:17:40,739] [   DEBUG] - greedy_zero_padding           : False
[2025-06-25 21:17:40,739] [   DEBUG] - lazy                          : False
[2025-06-25 21:17:40,739] [   DEBUG] - max_length                    : 2048
[2025-06-25 21:17:40,739] [   DEBUG] - pad_to_max_length             : False
[2025-06-25 21:17:40,739] [   DEBUG] - pad_to_multiple_of            : None
[2025-06-25 21:17:40,739] [   DEBUG] - save_generation_output        : False
[2025-06-25 21:17:40,740] [   DEBUG] - src_length                    : 1024
[2025-06-25 21:17:40,740] [   DEBUG] - task_name                     : None
[2025-06-25 21:17:40,740] [   DEBUG] - use_pose_convert              : False
[2025-06-25 21:17:40,740] [   DEBUG] - zero_padding                  : False
[2025-06-25 21:17:40,740] [   DEBUG] - 
[2025-06-25 21:17:40,740] [   DEBUG] - ============================================================
[2025-06-25 21:17:40,740] [   DEBUG] -    Generation Configuration Arguments   
[2025-06-25 21:17:40,740] [   DEBUG] - paddle commit id              : 129b5cec5427ca9d634f490f28263ad274aacdf8
[2025-06-25 21:17:40,740] [   DEBUG] - paddlenlp commit id           : 06378bb4e591363e14f44e2e8b0b90f39ee5527e.dirty
[2025-06-25 21:17:40,740] [   DEBUG] - top_k                         : 1
[2025-06-25 21:17:40,740] [   DEBUG] - top_p                         : 1.0
[2025-06-25 21:17:40,740] [   DEBUG] - 
[2025-06-25 21:17:40,741] [    INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42.
[2025-06-25 21:17:40,741] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: True
[2025-06-25 21:17:40,743] [    INFO] - Final model config: LlamaConfig {
  "alibi": false,
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "context_parallel_degree": -1,
  "dpo_config": null,
  "dtype": "float16",
  "eos_token_id": 2,
  "hidden_size": 768,
  "immediate_clear_past_key_value": false,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "long_sequence_init_args": {},
  "long_sequence_strategy_name": null,
  "long_sequence_strategy_type": null,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 8,
  "num_hidden_layers": 2,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "paddlenlp_version": "3.0.0b4.post20250625",
  "pipeline_parallel_degree": -1,
  "recompute": true,
  "refined_recompute": {
    "attention_column_ln": 0,
    "attention_row_ln": 0,
    "flash_attn": -1,
    "mlp_column_ln": 0,
    "mlp_row_ln": 0
  },
  "rms_norm_eps": 1e-06,
  "rope_scaling_factor": 1.0,
  "rope_scaling_type": null,
  "rope_theta": 10000.0,
  "sep_parallel_degree": -1,
  "seq_length": 2048,
  "tensor_parallel_degree": -1,
  "tensor_parallel_output": false,
  "tie_word_embeddings": false,
  "use_fast_layer_norm": false,
  "use_flash_attention": true,
  "use_flash_attention_for_generation": false,
  "use_last_token_for_generation": false,
  "use_long_sequence_strategies": false,
  "vocab_size": 32000
}

[2025-06-25 21:17:40,743] [    INFO] - Creating model
[2025-06-25 21:17:40,743] [    INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load '__internal_testing__/tiny-random-llama'.
[2025-06-25 21:17:40,744] [    INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/__internal_testing__/tiny-random-llama/model_state.pdparams
[2025-06-25 21:17:41,072] [    INFO] - Loaded weights file from disk, setting weights to model.
W0625 21:17:41.083987 83454 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
[2025-06-25 21:17:45,086] [    INFO] - All model checkpoint weights were used when initializing LlamaForCausalLM.

[2025-06-25 21:17:45,087] [    INFO] - All the weights of LlamaForCausalLM were initialized from the model checkpoint at __internal_testing__/tiny-random-llama.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[2025-06-25 21:17:45,125] [    INFO] - Generation config file not found, using a generation config created from the model config.
[2025-06-25 21:17:45,139] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load '__internal_testing__/tiny-random-llama'.
[2025-06-25 21:17:45,154] [    INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/__internal_testing__/tiny-random-llama/tokenizer_config.json
[2025-06-25 21:17:45,155] [    INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/__internal_testing__/tiny-random-llama/special_tokens_map.json
[2025-06-25 21:17:45,155] [    INFO] - load train
[2025-06-25 21:17:45,201] [    INFO] - load eval
[2025-06-25 21:17:45,208] [    INFO] - load test
[2025-06-25 21:17:45,208] [    INFO] - Trans the dataset text into token ids, please wait for a moment.
[2025-06-25 21:17:45,209] [    INFO] - The global seed is set to 42, local seed is set to 43 and random seed is set to 42.
[2025-06-25 21:17:45,270] [    INFO] - Using half precision
[2025-06-25 21:17:45,293] [   DEBUG] - ============================================================
[2025-06-25 21:17:45,293] [   DEBUG] -     Training Configuration Arguments    
[2025-06-25 21:17:45,294] [   DEBUG] - paddle commit id              : 129b5cec5427ca9d634f490f28263ad274aacdf8
[2025-06-25 21:17:45,294] [   DEBUG] - paddlenlp commit id           : 06378bb4e591363e14f44e2e8b0b90f39ee5527e.dirty
[2025-06-25 21:17:45,294] [   DEBUG] - _no_sync_in_gradient_accumulation: True
[2025-06-25 21:17:45,294] [   DEBUG] - adam_beta1                    : 0.9
[2025-06-25 21:17:45,294] [   DEBUG] - adam_beta2                    : 0.999
[2025-06-25 21:17:45,294] [   DEBUG] - adam_epsilon                  : 1e-08
[2025-06-25 21:17:45,294] [   DEBUG] - amp_custom_black_list         : None
[2025-06-25 21:17:45,294] [   DEBUG] - amp_custom_white_list         : None
[2025-06-25 21:17:45,294] [   DEBUG] - amp_master_grad               : False
[2025-06-25 21:17:45,294] [   DEBUG] - auto_parallel_resume_form_hybrid_parallel: False
[2025-06-25 21:17:45,294] [   DEBUG] - autotuner_benchmark           : False
[2025-06-25 21:17:45,295] [   DEBUG] - benchmark                     : False
[2025-06-25 21:17:45,295] [   DEBUG] - bf16                          : False
[2025-06-25 21:17:45,295] [   DEBUG] - bf16_full_eval                : False
[2025-06-25 21:17:45,295] [   DEBUG] - ckpt_quant_stage              : O0
[2025-06-25 21:17:45,295] [   DEBUG] - context_parallel_degree       : -1
[2025-06-25 21:17:45,295] [   DEBUG] - count_trained_tokens          : False
[2025-06-25 21:17:45,295] [   DEBUG] - current_device                : gpu:0
[2025-06-25 21:17:45,295] [   DEBUG] - data_parallel_config          : 
[2025-06-25 21:17:45,295] [   DEBUG] - data_parallel_degree          : 1
[2025-06-25 21:17:45,295] [   DEBUG] - data_parallel_rank            : 0
[2025-06-25 21:17:45,295] [   DEBUG] - dataloader_drop_last          : False
[2025-06-25 21:17:45,295] [   DEBUG] - dataloader_num_workers        : 0
[2025-06-25 21:17:45,295] [   DEBUG] - dataloader_shuffle            : True
[2025-06-25 21:17:45,295] [   DEBUG] - dataset_batch_size            : 1000
[2025-06-25 21:17:45,295] [   DEBUG] - dataset_kwargs                : {}
[2025-06-25 21:17:45,296] [   DEBUG] - dataset_num_proc              : None
[2025-06-25 21:17:45,296] [   DEBUG] - dataset_rank                  : 0
[2025-06-25 21:17:45,296] [   DEBUG] - dataset_text_field            : text
[2025-06-25 21:17:45,296] [   DEBUG] - dataset_world_size            : 1
[2025-06-25 21:17:45,296] [   DEBUG] - ddp_find_unused_parameters    : None
[2025-06-25 21:17:45,296] [   DEBUG] - decay_steps                   : 0
[2025-06-25 21:17:45,296] [   DEBUG] - device                        : gpu
[2025-06-25 21:17:45,296] [   DEBUG] - disable_tqdm                  : True
[2025-06-25 21:17:45,296] [   DEBUG] - distributed_dataloader        : False
[2025-06-25 21:17:45,296] [   DEBUG] - do_eval                       : True
[2025-06-25 21:17:45,296] [   DEBUG] - do_export                     : False
[2025-06-25 21:17:45,296] [   DEBUG] - do_predict                    : False
[2025-06-25 21:17:45,296] [   DEBUG] - do_train                      : True
[2025-06-25 21:17:45,296] [   DEBUG] - enable_auto_parallel          : False
[2025-06-25 21:17:45,297] [   DEBUG] - enable_zero_cost_checkpoint   : False
[2025-06-25 21:17:45,297] [   DEBUG] - eval_accumulation_steps       : 16
[2025-06-25 21:17:45,297] [   DEBUG] - eval_batch_size               : 8
[2025-06-25 21:17:45,297] [   DEBUG] - eval_packing                  : None
[2025-06-25 21:17:45,297] [   DEBUG] - eval_steps                    : None
[2025-06-25 21:17:45,297] [   DEBUG] - evaluation_strategy           : IntervalStrategy.EPOCH
[2025-06-25 21:17:45,297] [   DEBUG] - expert_max_capacity           : 4294967296
[2025-06-25 21:17:45,297] [   DEBUG] - expert_min_capacity           : 1
[2025-06-25 21:17:45,297] [   DEBUG] - expert_parallel_degree        : -1
[2025-06-25 21:17:45,297] [   DEBUG] - expert_tensor_parallel_degree : -1
[2025-06-25 21:17:45,297] [   DEBUG] - flash_device_save_steps       : 0
[2025-06-25 21:17:45,297] [   DEBUG] - flatten_param_grads           : False
[2025-06-25 21:17:45,297] [   DEBUG] - force_reshard_pp              : False
[2025-06-25 21:17:45,298] [   DEBUG] - fp16                          : True
[2025-06-25 21:17:45,298] [   DEBUG] - fp16_full_eval                : False
[2025-06-25 21:17:45,298] [   DEBUG] - fp16_opt_level                : O2
[2025-06-25 21:17:45,298] [   DEBUG] - fuse_sequence_parallel_allreduce: False
[2025-06-25 21:17:45,298] [   DEBUG] - gradient_accumulation_steps   : 4
[2025-06-25 21:17:45,298] [   DEBUG] - greater_is_better             : True
[2025-06-25 21:17:45,298] [   DEBUG] - hybrid_parallel_topo_order    : pp_first
[2025-06-25 21:17:45,298] [   DEBUG] - ignore_data_skip              : False
[2025-06-25 21:17:45,298] [   DEBUG] - ignore_load_lr_and_optim      : False
[2025-06-25 21:17:45,298] [   DEBUG] - ignore_save_lr_and_optim      : True
[2025-06-25 21:17:45,298] [   DEBUG] - label_names                   : None
[2025-06-25 21:17:45,298] [   DEBUG] - lazy_data_processing          : True
[2025-06-25 21:17:45,298] [   DEBUG] - learning_rate                 : 3e-05
[2025-06-25 21:17:45,298] [   DEBUG] - load_best_model_at_end        : True
[2025-06-25 21:17:45,298] [   DEBUG] - load_sharded_model            : False
[2025-06-25 21:17:45,298] [   DEBUG] - local_process_index           : 0
[2025-06-25 21:17:45,298] [   DEBUG] - local_rank                    : -1
[2025-06-25 21:17:45,299] [   DEBUG] - log_level                     : -1
[2025-06-25 21:17:45,299] [   DEBUG] - log_level_replica             : -1
[2025-06-25 21:17:45,299] [   DEBUG] - log_on_each_node              : True
[2025-06-25 21:17:45,299] [   DEBUG] - logging_dir                   : /tmp/tmp2wz24yqo/runs/Jun25_21-17-40_jupyter-942478-8790893
[2025-06-25 21:17:45,299] [   DEBUG] - logging_first_step            : False
[2025-06-25 21:17:45,299] [   DEBUG] - logging_steps                 : 1
[2025-06-25 21:17:45,299] [   DEBUG] - logging_strategy              : IntervalStrategy.STEPS
[2025-06-25 21:17:45,299] [   DEBUG] - logical_process_index         : 0
[2025-06-25 21:17:45,299] [   DEBUG] - lr_end                        : 1e-07
[2025-06-25 21:17:45,299] [   DEBUG] - lr_scheduler_type             : SchedulerType.LINEAR
[2025-06-25 21:17:45,299] [   DEBUG] - max_evaluate_steps            : -1
[2025-06-25 21:17:45,299] [   DEBUG] - max_grad_norm                 : 1.0
[2025-06-25 21:17:45,299] [   DEBUG] - max_seq_length                : 2048
[2025-06-25 21:17:45,299] [   DEBUG] - max_steps                     : -1
[2025-06-25 21:17:45,299] [   DEBUG] - metric_for_best_model         : accuracy
[2025-06-25 21:17:45,300] [   DEBUG] - metrics_output_path           : None
[2025-06-25 21:17:45,300] [   DEBUG] - min_lr                        : 0.0
[2025-06-25 21:17:45,300] [   DEBUG] - minimum_eval_times            : None
[2025-06-25 21:17:45,300] [   DEBUG] - model_init_kwargs             : None
[2025-06-25 21:17:45,300] [   DEBUG] - no_cuda                       : False
[2025-06-25 21:17:45,300] [   DEBUG] - no_recompute_layers           : None
[2025-06-25 21:17:45,300] [   DEBUG] - num_cycles                    : 0.5
[2025-06-25 21:17:45,300] [   DEBUG] - num_train_epochs              : 3.0
[2025-06-25 21:17:45,300] [   DEBUG] - offload_optim                 : False
[2025-06-25 21:17:45,300] [   DEBUG] - offload_recompute_inputs      : False
[2025-06-25 21:17:45,300] [   DEBUG] - optim                         : OptimizerNames.ADAMW_MINI
[2025-06-25 21:17:45,300] [   DEBUG] - optimizer_name_suffix         : None
[2025-06-25 21:17:45,300] [   DEBUG] - ordered_save_group_size       : 0
[2025-06-25 21:17:45,300] [   DEBUG] - output_dir                    : /tmp/tmp2wz24yqo
[2025-06-25 21:17:45,301] [   DEBUG] - output_signal_dir             : /tmp/tmp2wz24yqo
[2025-06-25 21:17:45,301] [   DEBUG] - overwrite_output_dir          : False
[2025-06-25 21:17:45,301] [   DEBUG] - pad_token_id                  : 0
[2025-06-25 21:17:45,301] [   DEBUG] - past_index                    : -1
[2025-06-25 21:17:45,301] [   DEBUG] - pdc_download_ckpt             : False
[2025-06-25 21:17:45,301] [   DEBUG] - pdc_download_timeout          : 300
[2025-06-25 21:17:45,301] [   DEBUG] - per_device_eval_batch_size    : 8
[2025-06-25 21:17:45,301] [   DEBUG] - per_device_train_batch_size   : 4
[2025-06-25 21:17:45,301] [   DEBUG] - pipeline_parallel_config      : 
[2025-06-25 21:17:45,301] [   DEBUG] - pipeline_parallel_degree      : -1
[2025-06-25 21:17:45,301] [   DEBUG] - pipeline_parallel_rank        : 0
[2025-06-25 21:17:45,301] [   DEBUG] - power                         : 1.0
[2025-06-25 21:17:45,301] [   DEBUG] - pp_recompute_interval         : 1
[2025-06-25 21:17:45,302] [   DEBUG] - prediction_loss_only          : False
[2025-06-25 21:17:45,302] [   DEBUG] - process_index                 : 0
[2025-06-25 21:17:45,302] [   DEBUG] - recompute                     : True
[2025-06-25 21:17:45,302] [   DEBUG] - recompute_granularity         : full
[2025-06-25 21:17:45,302] [   DEBUG] - recompute_use_reentrant       : False
[2025-06-25 21:17:45,302] [   DEBUG] - refined_recompute             : {'mlp_row_ln': 0, 'attention_row_ln': 0, 'attention_column_ln': 0, 'mlp_column_ln': 0, 'flash_attn': -1}
[2025-06-25 21:17:45,302] [   DEBUG] - release_grads                 : False
[2025-06-25 21:17:45,302] [   DEBUG] - remove_unused_columns         : True
[2025-06-25 21:17:45,302] [   DEBUG] - report_to                     : ['visualdl']
[2025-06-25 21:17:45,302] [   DEBUG] - resume_from_checkpoint        : None
[2025-06-25 21:17:45,302] [   DEBUG] - run_name                      : /tmp/tmp2wz24yqo
[2025-06-25 21:17:45,302] [   DEBUG] - save_on_each_node             : False
[2025-06-25 21:17:45,302] [   DEBUG] - save_rng_states               : True
[2025-06-25 21:17:45,302] [   DEBUG] - save_sharded_model            : False
[2025-06-25 21:17:45,302] [   DEBUG] - save_sharding_stage1_model_include_freeze_params: False
[2025-06-25 21:17:45,303] [   DEBUG] - save_steps                    : 500
[2025-06-25 21:17:45,303] [   DEBUG] - save_strategy                 : IntervalStrategy.EPOCH
[2025-06-25 21:17:45,303] [   DEBUG] - save_tokenizer                : True
[2025-06-25 21:17:45,303] [   DEBUG] - save_total_limit              : 1
[2025-06-25 21:17:45,303] [   DEBUG] - scale_loss                    : 32768
[2025-06-25 21:17:45,303] [   DEBUG] - seed                          : 42
[2025-06-25 21:17:45,303] [   DEBUG] - sep_parallel_degree           : -1
[2025-06-25 21:17:45,303] [   DEBUG] - sequence_parallel             : False
[2025-06-25 21:17:45,303] [   DEBUG] - sequence_parallel_config      : 
[2025-06-25 21:17:45,303] [   DEBUG] - sharding                      : []
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_comm_buffer_size_MB  : -1
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_degree               : -1
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_parallel_config      : 
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_parallel_degree      : -1
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_parallel_mesh_dimension: dp
[2025-06-25 21:17:45,304] [   DEBUG] - sharding_parallel_rank        : 0
[2025-06-25 21:17:45,304] [   DEBUG] - should_load_dataset           : True
[2025-06-25 21:17:45,304] [   DEBUG] - should_load_sharding_stage1_model: False
[2025-06-25 21:17:45,304] [   DEBUG] - should_log                    : True
[2025-06-25 21:17:45,304] [   DEBUG] - should_save                   : True
[2025-06-25 21:17:45,304] [   DEBUG] - should_save_model_state       : True
[2025-06-25 21:17:45,305] [   DEBUG] - should_save_model_with_tensor_fusion: False
[2025-06-25 21:17:45,305] [   DEBUG] - should_save_sharding_stage1_model: False
[2025-06-25 21:17:45,305] [   DEBUG] - skip_data_intervals           : None
[2025-06-25 21:17:45,305] [   DEBUG] - skip_memory_metrics           : True
[2025-06-25 21:17:45,305] [   DEBUG] - skip_profile_timer            : True
[2025-06-25 21:17:45,305] [   DEBUG] - split_inputs_sequence_dim     : True
[2025-06-25 21:17:45,305] [   DEBUG] - split_norm_comm               : False
[2025-06-25 21:17:45,305] [   DEBUG] - ssa_group_size_ratio          : 0.25
[2025-06-25 21:17:45,305] [   DEBUG] - tensor_parallel_config        : 
[2025-06-25 21:17:45,305] [   DEBUG] - tensor_parallel_degree        : -1
[2025-06-25 21:17:45,305] [   DEBUG] - tensor_parallel_output        : False
[2025-06-25 21:17:45,306] [   DEBUG] - tensor_parallel_rank          : 0
[2025-06-25 21:17:45,306] [   DEBUG] - tensorwise_offload_optimizer  : False
[2025-06-25 21:17:45,306] [   DEBUG] - to_static                     : False
[2025-06-25 21:17:45,306] [   DEBUG] - train_batch_size              : 4
[2025-06-25 21:17:45,306] [   DEBUG] - unified_checkpoint            : False
[2025-06-25 21:17:45,306] [   DEBUG] - unified_checkpoint_config     : 
[2025-06-25 21:17:45,306] [   DEBUG] - use_async_save                : False
[2025-06-25 21:17:45,306] [   DEBUG] - use_expert_parallel           : False
[2025-06-25 21:17:45,306] [   DEBUG] - use_flash_attention           : True
[2025-06-25 21:17:45,306] [   DEBUG] - use_fused_dropout_add         : False
[2025-06-25 21:17:45,306] [   DEBUG] - use_fused_linear              : False
[2025-06-25 21:17:45,307] [   DEBUG] - use_fused_linear_cross_entropy: False
[2025-06-25 21:17:45,307] [   DEBUG] - use_fused_rms_norm            : False
[2025-06-25 21:17:45,307] [   DEBUG] - use_fused_rope                : False
[2025-06-25 21:17:45,307] [   DEBUG] - use_hybrid_parallel           : False
[2025-06-25 21:17:45,307] [   DEBUG] - use_lowprecision_moment       : False
[2025-06-25 21:17:45,307] [   DEBUG] - use_ssa                       : False
[2025-06-25 21:17:45,307] [   DEBUG] - virtual_pp_degree             : 1
[2025-06-25 21:17:45,307] [   DEBUG] - wandb_api_key                 : None
[2025-06-25 21:17:45,307] [   DEBUG] - wandb_http_proxy              : None
[2025-06-25 21:17:45,307] [   DEBUG] - warmup_ratio                  : 0.0
[2025-06-25 21:17:45,307] [   DEBUG] - warmup_steps                  : 30
[2025-06-25 21:17:45,307] [   DEBUG] - weight_decay                  : 0.0
[2025-06-25 21:17:45,308] [   DEBUG] - weight_name_suffix            : None
[2025-06-25 21:17:45,308] [   DEBUG] - world_size                    : 1
[2025-06-25 21:17:45,308] [   DEBUG] - zcc_ema_interval              : 1
[2025-06-25 21:17:45,308] [   DEBUG] - zcc_pipeline_hooks_capacity_usage: 0.6
[2025-06-25 21:17:45,308] [   DEBUG] - zcc_save_ema_coef             : None
[2025-06-25 21:17:45,308] [   DEBUG] - zcc_workers_num               : 3
[2025-06-25 21:17:45,308] [   DEBUG] - 
[2025-06-25 21:17:45,308] [    INFO] - Starting training from resume_from_checkpoint : None
[2025-06-25 21:17:45,309] [ WARNING] - Warning: `named_parameters` is None, AdamWMini will use `parameters` instead, which may be incorrect.
[2025-06-25 21:17:45,310] [    INFO] - [timelog] checkpoint loading time: 0.00s (2025-06-25 21:17:45) 
[2025-06-25 21:17:45,310] [    INFO] - ***** Running training *****
[2025-06-25 21:17:45,310] [    INFO] -   Num examples = 20
[2025-06-25 21:17:45,310] [    INFO] -   Num Epochs = 3
[2025-06-25 21:17:45,310] [    INFO] -   Instantaneous batch size per device = 4
[2025-06-25 21:17:45,310] [    INFO] -   Total train batch size (w. parallel, distributed & accumulation) = 16
[2025-06-25 21:17:45,310] [    INFO] -   Gradient Accumulation steps = 4
[2025-06-25 21:17:45,310] [    INFO] -   Total optimization steps = 3
[2025-06-25 21:17:45,310] [    INFO] -   Total num train samples = 60
[2025-06-25 21:17:45,311] [   DEBUG] -   Number of trainable parameters = 104,599,296 (per device)
W0625 21:17:46.049122 83454 multiply_fwd_func.cc:76] got different data type, run type promotion automatically, this may cause data type been changed.
W0625 21:17:46.073673 83454 backward.cc:462] While running Node (MatmulGradNode) raises an EnforceNotMet exception
=============================================================== short test summary info ===============================================================
FAILED tests/llm/test_adamw_mini.py::FinetuneTest_0_llama::test_finetune - OSError: (External) OSError: (External) Exception: Not supported to retrieve a tensor saved by autograd multiple times that is no need to recomput...
1 failed in 10.86s

Jun 25 '25 13:06 megemini

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动，被标记为stale。

Aug 31 '25 00:08 github-actions[bot]