Sevin Fide Varoglu

Results 10 comments of Sevin Fide Varoglu

@baskaryan, @efriis, @eyurtsev, @hwchase17 please review

> Do you have a hlo for host offload to demonstrate this speed-up? Adding it into benchmark suite could guard host offloading features from regression for future development. Added. I...

> Could you also add a unit test to demonstrate how "dynamic_variable_tuple_indices" config is used under FusionDynamicMemcpyRewriter? Added to `copy_test` as DynamicMemcpyFusion::GetMemcpyDescriptorForFusion is in copy.cc

> I have a high level question: from the PR description, the benchmark performance data, it is not alway true that the runtimes are decreased, there are like 5 cases...

> Thanks for the explanation. In that case, I would think replacing that block of benchmark performance data with somethink similar to what you have just said is better. The...

> Would you please split host_offload_utils* to a PR, we can move forward with submitting that PR. Smaller PRs are preferred for many reasons, such as easy submission, we can...

> Did you remove the benchmark you add to this PR? we actually want to have such benchmarks. Merged as a separate PR https://github.com/openxla/xla/pull/34335

@qGentry Using JAX 0.4.35 `XLA_FLAGS="--xla_gpu_graph_level=0 --xla_gpu_enable_triton_gemm=false --xla_gpu_enable_command_buffer= "` and `SCAN=False`, I'm seeing a failure. ```Out of memory while trying to allocate 35701941112 bytes. *** Check failure stack trace: *** @...

@qGentry Can you please set `XLA_CLIENT_MEM_FRACTION=0.95` and use `--xla_gpu_copy_insertion_use_region_analysis` in addition to your existing flags and report back if it resolves the issue?

@qGentry `xla_gpu_memory_limit_slop_factor` flag could also help in this case. The default value is 95, so you can experiment with lower values (90, 80, 70, etc.). You can find more info...