maxtext
maxtext copied to clipboard
A simple, performant and scalable Jax LLM!
Hello, We converted the paxml checkpoint and resumed training with following config: ``` base_config: "base.yml" tokenizer_path: "/dockerx/vocab/c4_en_301_5Mexp2_spm.model" dataset_type: "tfds" dataset_path: "/ckpts/c4_mlperf_dataset" dataset_name: "en:3.0.4" eval_dataset_name: "en:3.0.5" split: "train2" tokenize_eval_data: False eval_data_column:...
Enable deterministic training with preemption when using tfds pipeline by checkpointing data iterator. Creates a checkpoint handler for data iterator that implements orbax.checkpoint.CheckpointHandler, similar to https://github.com/google/grain/blob/main/grain/_src/python/checkpoint_handlers.py. Handler utilizes tf.train.Checkpoint to...
Here is the minimum config required to support llama3.1 405B GPU Training.
Turning `enable_goodput_recording` and `monitor_goodput` on by default. Tested - [x] GCE ~1k steps [run](https://screenshot.googleplex.com/6wFY2wqp8YHMmDC) - [x] GKE ~1k steps [run](https://screenshot.googleplex.com/3TVyAVhSPx2KCFA), - [x] Example [logs](https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22cloud-tpu-multipod-dev%22%0Aresource.labels.location%3D%22us-central2%22%0Aresource.labels.cluster_name%3D%22dishaw-xpk-test-3%22%0Aresource.labels.namespace_name%3D%22default%22%0Alabels.k8s-pod%2Fjobset_sigs_k8s_io%2Fjobset-name%3D%22dishaw-goodput-maxtext-job-12%22%20severity%3E%3DDEFAULT;storageScope=project;cursorTimestamp=2024-09-27T00:10:55.977598170Z;startTime=2024-09-25T16:06:01.570Z;endTime=2024-09-27T22:25:38.299Z?e=13803378&mods=allow_workbench_image_override&project=cloud-tpu-multipod-dev)
@aireenmei Referring you here, because I think this issue is touched in [#571](https://github.com/AI-Hypercomputer/maxtext/issues/571) where you write: ``` I did not implement the auto restart because some users may not want...
For both `jax.profiler` (`profiler=xplane` in maxtext) and a GPU nsys profiler (`profiler=nsys` in maxtext) we upload the profile to the `base_output_directory` ([source](https://github.com/AI-Hypercomputer/maxtext/blob/0a919c19911ea2d99445e72a59e838f466b962c6/MaxText/pyconfig.py#L317)) Typically this directory is GCS, it can also...
This is a test issue.