Hi. I tried to add more information to the context than just the average of the previous tokens. I added a table that calculates the distance between tokens and adds to wei at the end. But I don't know if that makes sense. I made a global train_mode variable and during training it is set to True but when generating it is changed to False.

this is outputs after 3 times x 5000 iterations. I take 3 training shots. This is result after third training loop on google colab. Because my laptop is 2-3 x times slower than their machines. :)

step 4999: train loss 1.6128, val loss 1.7871

I'm a rookie on this field. Does it improve anything in the model at all?

For example, we have

xs = tensor([18., 47., 56., 57., 58.,  1., 15., 47.])

cm = torch.zeros((8))
cm[0] = xs[0]-xs[1]
cm[1] = xs[1]-xs[2]
cm[2] = xs[2]-xs[3]
cm[3] = xs[3]-xs[4]
cm[4] = xs[4]-xs[5]
cm[5] = xs[5]-xs[6]
cm[6] = xs[6]-xs[7]
cm # tensor([-29.,  -9.,  -1.,  -1.,  57., -14., -32.,   0.])
#cm[torch.arange(7)] = xs[torch.arange(0,7)] - xs[torch.arange(1,8)]
#tensor([-29.,  -9.,  -1.,  -1.,  57., -14., -32.,   0.])


Nor is defentain by his fouwl and speeceef:
What madam; light I come yours crother,
But acctey's sheath with is see;
Ask his tend, here such my brother menest as actiness, my nugleds, destired, as us as mind.
The flesse; and grace. or meheigness, apppecity-mean's young
contens youtful, patient, you are but,
His duke of bety that dead.
Lad, you lipp's sue, best feast man's break of the feel;
And here seet he is dost not minder hereignes?

Till behear, sit from teaths: I may were thee rude now.

Luce than he
voicess more conduce fiten:
Thy ham swoon. You from are what heart raim'n.
Low, War, as drengentaged, time fatrom what fight them griet remed justices.

I say, behow, so much down's gently as virtue,
Wlike deempade your penger, thy swoundship, war, face, what,
Wordespers a much'd of from
thee things then at mure pattal her,
As you set greaties, adoug-mue face I
Hom not bed mistrol!

First Servingman:
As was me noble, my answere reoten,
And moutster mockh'd nor agar
Do heaven, pomplarigeus again!
I robes, O, if sweet the hall hast
hatan's haste full sing then hasts. They capfite, noble eye
Ast can drawleign reessirtate, the reportation,
Whiced sea
Eving inteed to much liest as bear bands! I would, what coburpety than thou's merty thee hath;
Juliet's sepeak these love's hotnow,
I please the sain'st may see is ciest. I'll his husband smeen thee;
He last been it, that it then ure!

Show not thing's dryy:.
Inreal, when then I torchisoon the friends thet to can?

I, what Coast him think the ceque in prinhes,
Truth as plase,
That whit as where these that nappite the greation.

I'll womerch's like, noble,
As night my conscien'st; play speed the horn'd;
And know for your fucel'sim: as he do yes.
Now might now, by her kisnow be this lets; there are her;
Or or grast thou met's deed-indreat be his unbdound,
speak him do slaint, give lettain.

I wear the Mearage, and sheat, whipt the cucord.


This is the modification

if train_mode == True:
  cm = torch.zeros((block_size, block_size))
  cm = cm - wei[:, :, torch.arange(0,cm.shape[0])]
  rol = cm[:, :, torch.arange(0,cm.shape[0])].roll(-1)
  rol[:,:,cm.shape[0]-1] = float("-inf")
  wei = wei + cm

Here's the code

import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------


# wget
with open('input.txt', 'r', encoding='utf-8') as f:
    text =

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

train_mode = True

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y =,
    return x, y

def estimate_loss():
    out = {}
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        # wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        # wei = F.softmax(wei, dim=-1) # (B, T, T)
        # wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        global train_mode
        if train_mode == True:
          cm = torch.zeros((block_size, block_size))
          cm = cm - wei[:, :, torch.arange(0,cm.shape[0])]
          #print(cm.shape, wei.shape)
          # roll
          rol = cm[:, :, torch.arange(0,cm.shape[0])].roll(-1)
          # change last column
          rol[:,:,cm.shape[0]-1] = float("-inf")
          # and move to 0 by exp
          #print(cm.shape, wei.shape)
          wei = wei + cm
        #print(" afeter ", cm.shape, wei.shape)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out =[h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__() = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.Linear(4 * n_embd, n_embd),

    def forward(self, x):

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        head_size = n_embd // n_head = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x +
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        global train_mode
        train_mode = False
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            #print(logits, loss)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx =, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m =
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Training loop

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)


train_mode = False
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

