Sam Stoelinga

Results 223 comments of Sam Stoelinga

Let's discuss the multiple resource profiles here: #428 I need to think some more through that. Thanks for explaining why and how you plan to use mutability of model URL!...

It turned out to be more complex to implement when cacheProfile is used. However, since native vllm model from s3 support is there that may be a good alternative in...

Hitting this and unsure why this is happening. Any insights? Kubectl describe workload ``` Status: Conditions: Last Transition Time: 2025-06-16T21:48:02Z Message: ClusterQueue cluster-queue is inactive Observed Generation: 1 Reason: Inadmissible...

Maybe I'm misunderstanding the code... but input_dispatcher is None for the fuji models, so wouldn't it default to InputDispatcher already? @markblee

Are you saying I should create a custom InputDispatcher and pass that instead? That may make sense. Looking into that further.

Would these be the right settings for fsdp=16 and model=16 gbs=16 on v6e-256? ``` # Usually left unset. Defaults to # max(feed_logical_batch_size * num_physical_feeds, jax.device_count()). global_physical_batch_size = 16 # The...

Currently in Fuji models this is set: ``` cfg.input = input_tf_data.Input.default_config().set( is_training=True, source=train_input_source, processor=config_for_function(input_tf_data.identity), batcher=config_for_function(input_tf_data.batch).set( global_batch_size=train_batch_size, prefetch_buffer_size=tf.data.AUTOTUNE, pad_example_fn=input_tf_data.default_pad_example_fn, ), ) ``` This is what's inside input_tf_data.batch function: ``` num_data_feeds =...

Just sharing for now since it's related. I also hit this error when trying fsdp=16, mdoel=16 and gbs=128 on 256 chips: ``` Stack Summary (most recent call last): File "/usr/local/lib/python3.10/runpy.py",...

I think the error from CI is not related to my PR? https://github.com/apple/axlearn/actions/runs/14895076691/job/41907295505?pr=1163#step:8:9480 ``` #22 475.0 ==================================== ERRORS ==================================== #22 475.0 _______ ERROR collecting axlearn/open_api/metrics/code_contests_test.py ________ #22 475.0 ImportError while...

@markblee could you review again please? I've added the docstring and responded to your comment about why one of the XLA flags was removed intentionally.