Philip Pham

Results 6 comments of Philip Pham

To clarify, you are using https://github.com/google-research/bigbird/blob/5f2a5aa7fbab23e32e0e0b41c5f0192f0c023e05/bigbird/core/attention.py#L637 with `attention_type = 'block_sparse'` ? What's your sequence length ?

I see. Does the memory used change with sequence length? I don't suppose your are using XLA? BigBird can be as much as 30% faster with `tf.function(jit_compile=True)`. It also produces...

https://www.tensorflow.org/guide/profiler#memory_profile_tool may also be useful. The XLA memory viewer (https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#memory_viewer) is better but both are useful.

A single Titan X doesn't have enough HBM. For our GPU setup, we had 8 V100s for a total of 128GB of HBM. For a single Titan X, I think...

I, too, would like to see type hints and argument names, but my preferred implementation would be different. I'd prefer something less intrusive in the minibuffer. I think what tide...

Maybe not quite the same thing, but similar in spirit, it would be nice if the pallas_call could inherit replication rules for use with shard_map, so we don't have to...