Enrico Shippole
Enrico Shippole
Hi @lucidrains, Here are the results for training the GPT2 model on an A100 (40 GB). This is a different A100 I have not used before. I left everything the...
Hi Phil, I was wondering what your thoughts on adding Flash Attention 2 are? ```python n, device, h = x.shape[1], x.device, self.heads # pre layernorm x = self.norm(x) # attention...
Hello, Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into...
Hi, Thank you for the great research. I am working on implementing the findings from this paper in a different setting using TRLX. Unfortunately, when matching hyperparameters for A2C with...
Hello, A peer of mine ran the benchmark script on an A100. Under what conditions should we see the most significant gain for the sparse 24 linear or activations? ```...
Ring Attention should work with Deepspeed Ulysses, correct? Are there any notable issues combining deepspeed's efficient sequence parallelism with such an attention mechanism? I do understand flash attention works. https://github.com/zhuzilin/ring-flash-attention
Hi, Is there a file of the list of repositories (repos.txt) available to use for recreating the results in the sourcegraph notebook? > Once we have initialized our database, we...
Hi, I have been trying to make some progress on the backward kernel for training. Unfortunately, I am new to GPU programming and triton so I may be missing parts....
Hi @taki0112 , When running the Mobile ViT python file I receive an error. ```python v = MobileViT( image_size=(256, 256), dims=[96, 120, 144], channels=[16, 32, 48, 48, 64, 64, 80,...
Hi Phil, I have been working with @tomaarsen of HF and @haileyschoelkopf of EAI testing soft moe. One issue that was occurring was that the tensors were not contiguous: ```...