MEGABYTE-pytorch issues

Evaluation metric bits-per-byte

1

Hi there, Megabyte paper uses bits-per-byte in Table 2 as their evaluation metric. It seems it has difference compared with byte level perplexity, since their number in arXiv and Code...

jxiw

Why your Attention impl use kv dimention dim_head instead of inner_dim?

```python self.to_kv = nn.Linear(dim, dim_head * 2, bias = False) # expected self.to_kv = nn.Linear(dim, inner_dim * 2, bias = False) ``` Is this a trick? or bug?

Earthson

Training Results and Scaling

1

Hi there. I’ve run the training code in this repository for 25k out of the 100k batches and achieved a validation loss of around 1.28, or perplexity of 3.59. After...

MiscellaneousStuff

the patch embbeder implementations are different from the original paper

4

Thank you so much for taking the time to share your code with me! I appreciate your generosity in helping me better understand the paper. I noticed that your code...

mikegreen7892003

translation of model sizes from paper to model definition

how do we translate the various model size parameters provided in the paper to the max_seq_len and depth tuple arguments when constructing the model?

winglian

MEGABYTE-pytorch
MEGABYTE-pytorch copied to clipboard

Metadata

Evaluation metric bits-per-byte

Why your Attention impl use kv dimention dim_head instead of inner_dim?

Training Results and Scaling

the patch embbeder implementations are different from the original paper

translation of model sizes from paper to model definition

← Metadata

Owner

Metadata

MEGABYTE-pytorch MEGABYTE-pytorch copied to clipboard

Metadata

Evaluation metric bits-per-byte

Why your Attention impl use kv dimention dim_head instead of inner_dim?

Training Results and Scaling

the patch embbeder implementations are different from the original paper

translation of model sizes from paper to model definition

← Metadata

Owner

Metadata

MEGABYTE-pytorch
MEGABYTE-pytorch copied to clipboard