RAGatouille
RAGatouille copied to clipboard
UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value
When training, i sometimes get the error in the title. Here the full error:
#> Starting...
nranks = 1 num_gpus = 1 device=0
{
"query_token_id": "[unused0]",
"doc_token_id": "[unused1]",
"query_token": "[Q]",
"doc_token": "[D]",
"ncells": null,
"centroid_score_threshold": null,
"ndocs": null,
"load_index_with_mmap": false,
"index_path": null,
"nbits": 2,
"kmeans_niters": 4,
"resume": false,
"similarity": "cosine",
"bsize": 32,
"accumsteps": 1,
"lr": 5e-6,
"maxsteps": 500000,
"save_every": 0,
"warmup": 0,
"warmup_bert": null,
"relu": false,
"nway": 2,
"use_ib_negatives": true,
"reranker": false,
"distillation_alpha": 1.0,
"ignore_scores": false,
"model_name": "HBOColbert",
"query_maxlen": 32,
"attend_to_mask_tokens": false,
"interaction": "colbert",
"dim": 128,
"doc_maxlen": 256,
"mask_punctuation": true,
"checkpoint": "bert-base-german-cased",
"triples": "german\/train_data_0\/triples.train.colbert.jsonl",
"collection": "german\/train_data_0\/corpus.train.colbert.tsv",
"queries": "german\/train_data_0\/queries.train.colbert.tsv",
"index_name": null,
"overwrite": false,
"root": ".ragatouille\/",
"experiment": "colbert",
"index_root": null,
"name": "2024-01\/07\/23.03.43",
"rank": 0,
"nranks": 1,
"amp": true,
"gpus": 1
}
Using config.bsize = 32 (per process) and config.accumsteps = 1
[Jan 07, 23:04:30] #> Loading the queries from german/train_data_0/queries.train.colbert.tsv ...
[Jan 07, 23:04:30] #> Got 80 queries. All QIDs are unique.
[Jan 07, 23:04:30] #> Loading collection...
0M
Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-german-cased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
#> LR will use 0 warmup steps and linear decay over 500000 steps.
[Jan 07, 23:04:32] #> Done with all triples!
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
return_val = callee(config, *args)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line 146, in train
ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
^^^^^^^^^
UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value
This happens sometimes when i am using the trainer.train function:
trainer.train(batch_size=32,
nbits=2, # How many bits will the trained model use when compressing indexes
maxsteps=500000, # Maximum steps hard stop
use_ib_negatives=True, # Use in-batch negative to calculate loss
dim=128, # How many dimensions per embedding. 128 is the default and works well.
learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
use_relu=False, # Disable ReLU -- doesn't improve performance
warmup_steps="auto", # Defaults to 10%
)
This is part of the function where the issue lies (/colbert/training/training.py):
for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
if (warmup_bert is not None) and warmup_bert <= batch_idx:
set_bert_grad(colbert, True)
warmup_bert = None
this_batch_loss = 0.0
for batch in BatchSteps:
with amp.context():
try:
queries, passages, target_scores = batch
encoding = [queries, passages]
except:
encoding, target_scores = batch
encoding = [encoding.to(DEVICE)]
scores = colbert(*encoding)
if config.use_ib_negatives:
scores, ib_loss = scores
scores = scores.view(-1, config.nway)
if len(target_scores) and not config.ignore_scores:
target_scores = torch.tensor(target_scores).view(-1, config.nway).to(DEVICE)
target_scores = target_scores * config.distillation_alpha
target_scores = torch.nn.functional.log_softmax(target_scores, dim=-1)
log_scores = torch.nn.functional.log_softmax(scores, dim=-1)
loss = torch.nn.KLDivLoss(reduction='batchmean', log_target=True)(log_scores, target_scores)
else:
loss = nn.CrossEntropyLoss()(scores, labels[:scores.size(0)])
if config.use_ib_negatives:
if config.rank < 1:
print('\t\t\t\t', loss.item(), ib_loss.item())
loss += ib_loss
loss = loss / config.accumsteps
if config.rank < 1:
print_progress(scores)
amp.backward(loss)
this_batch_loss += loss.item()
train_loss = this_batch_loss if train_loss is None else train_loss
train_loss = train_loss_mu * train_loss + (1 - train_loss_mu) * this_batch_loss
amp.step(colbert, optimizer, scheduler)
if config.rank < 1:
print_message(batch_idx, train_loss)
manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None)
if config.rank < 1:
print_message("#> Done with all triples!")
ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
return ckpt_path # TODO: This should validate and return the best checkpoint, not just the last one.
Thanks in advance for any ideas or so to fix this issue :)
Solved it by just initializing batch_idx in the colbert library, not sure if its right though, but not getting the error anymore with that (in /colbert/training/training.py): @okhat
start_batch_idx = 0
batch_idx = start_batch_idx
# if config.resume:
# assert config.checkpoint is not None
# start_batch_idx = checkpoint['batch']
# reader.skip_to_batch(start_batch_idx, checkpoint['arguments']['bsize'])
for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
Since your query count is quite low, this could be a problem similar to the one here https://github.com/stanford-futuredata/ColBERT/issues/118? Does it still occur if you use a lower batch size? (without modifying the code in the upstream ColBERT)
Setting batch size to 4 or 8 also showed the same problem. What is weird is that sometimes this problem occurs, and sometimes not, until batch_idx is initialized before the loop manually.
This seems to be a bug in ColBERT.
If for some reason the for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
loop is not executed, then batch_idx
is not defined. This will cause the UnboundLocalError: cannot access local variable 'batch_idx' where it is not associated with a value
.
Yeah, this often occurs if there are no valid triplets to train on.
Here's a PR to colbert that I believe should alleviate this issue in many cases.
https://github.com/stanford-futuredata/ColBERT/pull/312
Shouldnt this be easy to catch earlier up the call stack and emit a warning or error?