Tao
Results
2
comments of
Tao
We meet the same error when running checkpoting 70B model with 64 progresses. We have limited hardware hosts and each host has only 256GB memory(required total 912GB total). it's very...
Hi @zhenghh04 ,the checkpointing datasize subcommand has no "args.mpi_params",Bug it used in function update_args: cli.py ` ## Line 330: def add_checkpointing_arguments(checkpointing_parsers): .... if _parser == run_benchmark: _parser.add_argument('--exec-type', '-et', type=EXEC_TYPE, choices=list(EXEC_TYPE),...