Matthew Davidow
Matthew Davidow
Tested via creating a multihost_job, simulating a maintenance event, and confirming we can see the logs even after recovering from the simulated maintenance event: Create multihost_job ``` python3 multihost_job.py --COMMAND="bash...
Tested on 2x VLP
We are seeing ``` DeprecationWarning: `product` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `prod` instead. ```
For both `jax.profiler` (`profiler=xplane` in maxtext) and a GPU nsys profiler (`profiler=nsys` in maxtext) we upload the profile to the `base_output_directory` ([source](https://github.com/AI-Hypercomputer/maxtext/blob/0a919c19911ea2d99445e72a59e838f466b962c6/MaxText/pyconfig.py#L317)) Typically this directory is GCS, it can also...
# Description Add support for using PP with deepseek, including with the new feature `pipeline_parallel_layers` which only pipelines a subset of layers. This change can help out with SPMD pipelining...
# Description We saw extra memory usage where it was hard to fit PP=21 TP=4 for llama405B with fsdp_ag_once which is surprising - memory usage should be dominated by AG...