MEGABYTE-pytorch
MEGABYTE-pytorch copied to clipboard
Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Pytorch
Hi there, Megabyte paper uses bits-per-byte in Table 2 as their evaluation metric. It seems it has difference compared with byte level perplexity, since their number in arXiv and Code...
```python self.to_kv = nn.Linear(dim, dim_head * 2, bias = False) # expected self.to_kv = nn.Linear(dim, inner_dim * 2, bias = False) ``` Is this a trick? or bug?
Hi there. I’ve run the training code in this repository for 25k out of the 100k batches and achieved a validation loss of around 1.28, or perplexity of 3.59. After...
Thank you so much for taking the time to share your code with me! I appreciate your generosity in helping me better understand the paper. I noticed that your code...
how do we translate the various model size parameters provided in the paper to the max_seq_len and depth tuple arguments when constructing the model?