maxtext issues

[Don't Merge] Multihost Decode Monkey Patch

1

Debug Strange GPU Bug

The code works fine on TPU, crashes on GPU with a strange error. (Look at the test logs below while running train.py)

Fix multihost_job cloud logging in case of maintenance event + recovery

Tested via creating a multihost_job, simulating a maintenance event, and confirming we can see the logs even after recovering from the simulated maintenance event: Create multihost_job ``` python3 multihost_job.py --COMMAND="bash...

gobbleturk

Data loader for Maxtext training emulator for storage

1

The standalone data loader, sets up the model and data iterator similar to the train_loop of train.py. The data loader iterates through batches of data, to log step time of...

RoshaniN

Increase --shm-size for docker run on GPU

* `--shm-size` is increased to `1g` for `docker run` on GPU because the default value of 64mb might not be sufficient for other set of GPUs (e.g. A100-40gb-8) * `--shm-size=1g`...

michelle-yooh

pull ready

Custom metrics emitting

* Cloud monitoring prototype * Checkpoint initialization metrics emitting

priyanka-ganesha

test

michelle-yooh

fixed integer check bug

2

the assertion doesn't check for determined_val being integer, missing function call.

grad0s

Simplify the mulithost data loading code

See if this is an improvement for your purposes. This PR modifies the multihost data put code to infer the global shapes and build `NamedSharding`s lazily at load time. This...

ZacCranko

profiler to start only for the last step

Starting at step 0 results in an error if training is re-run from the checkpoint saved at step 0 or later. This change starts the profile at the end of...

ultrons

maxtext
maxtext copied to clipboard

Metadata

[Don't Merge] Multihost Decode Monkey Patch

Debug Strange GPU Bug

Fix multihost_job cloud logging in case of maintenance event + recovery

Data loader for Maxtext training emulator for storage

Increase --shm-size for docker run on GPU

Custom metrics emitting

test

fixed integer check bug

Simplify the mulithost data loading code

profiler to start only for the last step

← Metadata

Owner

Metadata

maxtext maxtext copied to clipboard

Metadata

← Metadata

Owner

Metadata

maxtext
maxtext copied to clipboard