rivershah
rivershah
Setting the `steps_per_execution` parameter to a large value can speed up the training process on toy datasets (device utilization goes up), but it may cause problems when training on larger...
Distributed training is broken in keras3. Please look and increase prio. I am pretty sure variable aggregation and synchronization is not applied correctly. On further digging I used `tf.Variable` and...
@fchollet This is a simple bug but keras 3 distribution metrics reporting is definitely broken due to this. Consequently learning rate scheduling, early stopping and other training steps are also...
We still need a decorator fn in keras that does `jax.jit, tf.function(jit_compile=True) etc...` under the hood depending on backend. In downstream code, ideally we don't have to write any backend...
This is a keras issue. Please see imports and patching ``` from keras.src.trainers import data_adapters def patched_is_tf_dataset(x): if hasattr(x, "__class__"): for parent in x.__class__.__mro__: if parent.__name__ in ( "DatasetV2", "DistributedDataset",...
I noticed that if we allow google batch to automatically pick, then the issue goes away. Specifying disk types for newer machines such as `n4-standard` or `a3-highgpu` breaks otherwise ```...
This risks starvation. What is a graceful way to trigger fast failure / timeout please? For example we submit jobs on large gpu machines which can go without availability for...
Excellent, requesting that we please implement this
For some gpu machines I get events like this: ``` STATUS_CHANGED 2025-10-01T15:45:28.439698938Z Job state is set from SCHEDULED to RUNNING for job projects/[PROJECT_NUMBER]/locations/us-central1/jobs/[JOB_ID]. OPERATIONAL_INFO 2025-10-01T15:37:50.018Z VM in Managed Instance Group...
Please fix this. `CLAUDE.md` is not being respected