Nikita Balagansky
Nikita Balagansky
Hi, @marcusau! In this repository, I use the dataset for masked language modelling task. The example in the `data` folder is just for fast testing in CI. **So, for now,...
Well, can you tell about your task more specific? What is the language of your dataset? What model are you going to use as a teacher? If you need your...
Okay, you can pass something like this: https://gist.github.com/elephantmipt/4287f5792a4c1e716d2f62db623646cf . Don't forget to specify path to your dataset and text_field in config above. You can run it with `catalyst-dl run -C...
Thank you for the detailed reply! I found that the smallest mamba-130m model uses 24 layers instead of 12, according to the [config](). Is this a case for the wikitext...
Thank you for clarifying! I found it confusing that the README file mentions a 12-layer model, https://github.com/state-spaces/mamba/blob/2ee7fd287a8f5c826af6f69ae3aad4682c4afd15/README.md?plain=1#L85 while on Hugging Face, there is a 24-layer model.
> The README mentions the double layer count right below the table, do you have a suggestion for a presentation that would be more clear? I think 96ec4e4 solved all...
Hi, thank you for your interest! You can find pre-trained weights and simple generation code [here](https://github.com/elephantmipt/rebased_minimal). We are planning to merge it in this repo soon. Note that there is...
Hi, thank you for your interest. Sorry for the confusion, by phi=x^2 we use simplified notation. From my perspective in case of sim(q, k) = (q^Tk)^2 we have **phi(q) !=...
Hi, I've just finished training the small 124M model, and it seems that replacing conv1d with sliding window attention is orthogonal to the Based/ReBased performance, as we achieve slightly better...