Results 24 issues of Kirill Golikov

For x64 platform 1. All **CMemoryHandlers** sizes are decreased 24 bytes --> 16 bytes 2. Any **CMemoryHandlers** has been allocated, their corresponding structs sizes are decreased 32 bytes --> 16...

Origin article and code * https://arxiv.org/pdf/2009.14794.pdf * https://github.com/google-research/google-research/blob/master/performer/fast_attention/tensorflow/fast_attention.py * https://blog.research.google/2020/10/rethinking-attention-with-performers.html * https://medium.com/analytics-vidhya/paper-explained-rethinking-attention-with-performers-b207f4bf4bc5 * https://www.youtube.com/watch?v=xJrKIPwVwGM

The idea behind eliminating unnecessary synchronizations for CUDA is that scalar constants can be passed to GPU computation kernels from host memory by value. It would be possible to replace...

This PR should be **rejected**, the PR https://github.com/neoml-lib/neoml/pull/1070 should be used instead Please, merge first * https://github.com/neoml-lib/neoml/pull/1045

Please, merge before * https://github.com/neoml-lib/neoml/pull/1049