RAGatouille icon indicating copy to clipboard operation
RAGatouille copied to clipboard

UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value

Open tm17-abcgen opened this issue 1 year ago • 7 comments

When training, i sometimes get the error in the title. Here the full error:

#> Starting...
nranks = 1       num_gpus = 1    device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "nbits": 2,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 500000,
    "save_every": 0,
    "warmup": 0,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": "HBOColbert",
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 256,
    "mask_punctuation": true,
    "checkpoint": "bert-base-german-cased",
    "triples": "german\/train_data_0\/triples.train.colbert.jsonl",
    "collection": "german\/train_data_0\/corpus.train.colbert.tsv",
    "queries": "german\/train_data_0\/queries.train.colbert.tsv",
    "index_name": null,
    "overwrite": false,
    "root": ".ragatouille\/",
    "experiment": "colbert",
    "index_root": null,
    "name": "2024-01\/07\/23.03.43",
    "rank": 0,
    "nranks": 1,
    "amp": true,
    "gpus": 1
}
Using config.bsize = 32 (per process) and config.accumsteps = 1
[Jan 07, 23:04:30] #> Loading the queries from german/train_data_0/queries.train.colbert.tsv ...
[Jan 07, 23:04:30] #> Got 80 queries. All QIDs are unique.

[Jan 07, 23:04:30] #> Loading collection...
0M 
Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-german-cased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
#> LR will use 0 warmup steps and linear decay over 500000 steps.
[Jan 07, 23:04:32] #> Done with all triples!
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line 146, in train
    ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
                                                               ^^^^^^^^^
UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value

This happens sometimes when i am using the trainer.train function:

trainer.train(batch_size=32,
            nbits=2, # How many bits will the trained model use when compressing indexes
            maxsteps=500000, # Maximum steps hard stop
            use_ib_negatives=True, # Use in-batch negative to calculate loss
            dim=128, # How many dimensions per embedding. 128 is the default and works well.
            learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
            doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
            use_relu=False, # Disable ReLU -- doesn't improve performance
            warmup_steps="auto", # Defaults to 10%
        )

This is part of the function where the issue lies (/colbert/training/training.py):

for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
        if (warmup_bert is not None) and warmup_bert <= batch_idx:
            set_bert_grad(colbert, True)
            warmup_bert = None

        this_batch_loss = 0.0

        for batch in BatchSteps:
            with amp.context():
                try:
                    queries, passages, target_scores = batch
                    encoding = [queries, passages]
                except:
                    encoding, target_scores = batch
                    encoding = [encoding.to(DEVICE)]

                scores = colbert(*encoding)

                if config.use_ib_negatives:
                    scores, ib_loss = scores

                scores = scores.view(-1, config.nway)

                if len(target_scores) and not config.ignore_scores:
                    target_scores = torch.tensor(target_scores).view(-1, config.nway).to(DEVICE)
                    target_scores = target_scores * config.distillation_alpha
                    target_scores = torch.nn.functional.log_softmax(target_scores, dim=-1)

                    log_scores = torch.nn.functional.log_softmax(scores, dim=-1)
                    loss = torch.nn.KLDivLoss(reduction='batchmean', log_target=True)(log_scores, target_scores)
                else:
                    loss = nn.CrossEntropyLoss()(scores, labels[:scores.size(0)])

                if config.use_ib_negatives:
                    if config.rank < 1:
                        print('\t\t\t\t', loss.item(), ib_loss.item())

                    loss += ib_loss

                loss = loss / config.accumsteps

            if config.rank < 1:
                print_progress(scores)

            amp.backward(loss)

            this_batch_loss += loss.item()

        train_loss = this_batch_loss if train_loss is None else train_loss
        train_loss = train_loss_mu * train_loss + (1 - train_loss_mu) * this_batch_loss

        amp.step(colbert, optimizer, scheduler)

        if config.rank < 1:
            print_message(batch_idx, train_loss)
            manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None)

    if config.rank < 1:
        print_message("#> Done with all triples!")
        ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)

        return ckpt_path  # TODO: This should validate and return the best checkpoint, not just the last one.

Thanks in advance for any ideas or so to fix this issue :)

tm17-abcgen avatar Jan 07 '24 22:01 tm17-abcgen

Solved it by just initializing batch_idx in the colbert library, not sure if its right though, but not getting the error anymore with that (in /colbert/training/training.py): @okhat

start_batch_idx = 0
batch_idx = start_batch_idx

# if config.resume:
#     assert config.checkpoint is not None
#     start_batch_idx = checkpoint['batch']

#     reader.skip_to_batch(start_batch_idx, checkpoint['arguments']['bsize'])

for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):

tm17-abcgen avatar Jan 07 '24 22:01 tm17-abcgen

Since your query count is quite low, this could be a problem similar to the one here https://github.com/stanford-futuredata/ColBERT/issues/118? Does it still occur if you use a lower batch size? (without modifying the code in the upstream ColBERT)

bclavie avatar Jan 07 '24 22:01 bclavie

Setting batch size to 4 or 8 also showed the same problem. What is weird is that sometimes this problem occurs, and sometimes not, until batch_idx is initialized before the loop manually.

tm17-abcgen avatar Jan 07 '24 23:01 tm17-abcgen

This seems to be a bug in ColBERT.

If for some reason the for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader): loop is not executed, then batch_idx is not defined. This will cause the UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value.

filippo82 avatar Jan 08 '24 00:01 filippo82

Yeah, this often occurs if there are no valid triplets to train on.

JoshuaPurtell avatar Feb 27 '24 15:02 JoshuaPurtell

Here's a PR to colbert that I believe should alleviate this issue in many cases.

https://github.com/stanford-futuredata/ColBERT/pull/312

JoshuaPurtell avatar Feb 27 '24 16:02 JoshuaPurtell

Shouldnt this be easy to catch earlier up the call stack and emit a warning or error?

gituser768 avatar Apr 08 '24 18:04 gituser768