bigbird I've added bigbird's attention to my model, but not seeing a decrease in memory

I've added bigbird's attention to my model, but not seeing a decrease in memory

Open Currie32 opened this issue 2 years ago • 5 comments

I've replaced the attention layers in Enformer with those in bigbird, but the memory usage calculated by tf.get_memory_info shows the usage is still basically the same (within 1%). I'm wondering if I need to include code from the encoder or decoder to see a decrease in memory usage?

Thanks!

May 09 '22 13:05 Currie32

To clarify, you are using https://github.com/google-research/bigbird/blob/5f2a5aa7fbab23e32e0e0b41c5f0192f0c023e05/bigbird/core/attention.py#L637 with attention_type = 'block_sparse' ?

What's your sequence length ?

May 10 '22 14:05 ppham27

Correct, I'm using that class with block_sparse attention. When the sequence enters the attention layer, its length is 1536.

May 10 '22 15:05 Currie32

I see. Does the memory used change with sequence length?

I don't suppose your are using XLA? BigBird can be as much as 30% faster with tf.function(jit_compile=True). It also produces better memory profiles that make it easier to debug.

May 10 '22 15:05 ppham27

Yes, the memory used increases with sequence length.

I'm not using XLA, and thanks for the tip!

May 11 '22 12:05 Currie32

https://www.tensorflow.org/guide/profiler#memory_profile_tool may also be useful. The XLA memory viewer (https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#memory_viewer) is better but both are useful.

May 11 '22 14:05 ppham27

bigbird bigbird copied to clipboard

I've added bigbird's attention to my model, but not seeing a decrease in memory

bigbird
bigbird copied to clipboard