rivershah comments

Results 44 comments of


                                            rivershah

steps_per_execution autotune

Setting the `steps_per_execution` parameter to a large value can speed up the training process on toy datasets (device utilization goes up), but it may cause problems when training on larger...

Keras 3 gives incorrect output from evaluate/fit in distributed context

Distributed training is broken in keras3. Please look and increase prio. I am pretty sure variable aggregation and synchronization is not applied correctly. On further digging I used `tf.Variable` and...

Keras 3 gives incorrect output from evaluate/fit in distributed context

@fchollet This is a simple bug but keras 3 distribution metrics reporting is definitely broken due to this. Consequently learning rate scheduling, early stopping and other training steps are also...

Backend-agnostic JIT compilation support for non supervised learning workflows

We still need a decorator fn in keras that does `jax.jit, tf.function(jit_compile=True) etc...` under the hood depending on backend. In downstream code, ideally we don't have to write any backend...

datas_adapters:is_tf_dataset should also support DistributedDatasetsFromFunction

This is a keras issue. Please see imports and patching ``` from keras.src.trainers import data_adapters def patched_is_tf_dataset(x): if hasattr(x, "__class__"): for parent in x.__class__.__mro__: if parent.__name__ in ( "DatasetV2", "DistributedDataset",...

google-batch disk-type error

I noticed that if we allow google batch to automatically pick, then the issue goes away. Specifying disk types for newer machines such as `n4-standard` or `a3-highgpu` breaks otherwise ```...

google-batch quota error does not trigger job failure

This risks starvation. What is a graceful way to trigger fast failure / timeout please? For example we submit jobs on large gpu machines which can go without availability for...

google-batch quota error does not trigger job failure

Excellent, requesting that we please implement this

google-batch quota error does not trigger job failure

For some gpu machines I get events like this: ``` STATUS_CHANGED 2025-10-01T15:45:28.439698938Z Job state is set from SCHEDULED to RUNNING for job projects/[PROJECT_NUMBER]/locations/us-central1/jobs/[JOB_ID]. OPERATIONAL_INFO 2025-10-01T15:37:50.018Z VM in Managed Instance Group...

[FEATURE REQUEST] Allow disabling Claude attribution in git commit messages

Please fix this. `CLAUDE.md` is not being respected