transformers How to correct TypeError: zip argument #1 must support iteration training in multiple GPU

I am doing a creating custom pytorch layer and model training using Trainer API function on top of Hugging face model.

When I run on single GPU, it trains fine. But when I train it on multiple GPU it throws me error.

TypeError: zip argument #1 must support iteration training in multiple GPU

Training Code

bert_model = BertForTokenClassification.from_pretrained( model_checkpoint,id2label=id2label,label2id=label2id)
bert_model.config.output_hidden_states=True


class BERT_CUSTOM(nn.Module):
    
    
    def __init__(self, bert_model,id2label,num_labels):
        
        
        
        super(BERT_CUSTOM, self).__init__()
        self.bert = bert_model
        self.config=self.bert.config
        self.dropout = nn.Dropout(0.25)
        self.classifier = nn.Linear(768, num_labels)
        self.crf = CRF(num_labels, batch_first = True)
        
    
    def forward(self, input_ids, attention_mask,  labels=None, token_type_ids=None):
        
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        sequence_output = torch.stack((outputs[1][-1], outputs[1][-2], outputs[1][-3], outputs[1][-4])).mean(dim=0)
        sequence_output = self.dropout(sequence_output)
        emission = self.classifier(sequence_output) # [32,256,21] logits
        
        if labels is not None:
            
            labels=labels.reshape(attention_mask.size()[0],attention_mask.size()[1])
            loss = -self.crf(log_soft(emission, 2), labels, mask=attention_mask.type(torch.uint8), reduction='mean')
            prediction = self.crf.decode(emission, mask=attention_mask.type(torch.uint8))
            return [loss, prediction]
                
        else:
            
            prediction = self.crf.decode(emission, mask=attention_mask.type(torch.uint8))
            prediction=[id2label[k] for k in prediction]
            return prediction

Training API

model = BERT_CUSTOM(bert_model, id2label,num_labels=len(label2id))
model.to(device)

args = TrainingArguments(
    "model",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    per_device_train_batch_size=32,
    fp16=True
    
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data,
    tokenizer=tokenizer)

trainer.train()

Feb 12 '23 13:02 pratikchhapolika

Hi @pratikchhapolika, without a full traceback of the errors, we can't tell if this is a transfomers issue or a issue in your custom modeling code.

Feb 13 '23 15:02 ydshieh

Hi @pratikchhapolika, without a full traceback of the errors, we can't tell if this is a transfomers issue or a issue in your custom modeling code.

This is the only error I get @ydshieh

TypeError: zip argument #1 must support iteration training in multiple GPU

Feb 15 '23 06:02 pratikchhapolika

Hi @pratikchhapolika

What @ydshieh is requesting is the full output from your terminal printed out between the line of execution and the error e.g. something like this:

(ml) amyroberts:transformers $ python ../scripts/dummy_traceback.py 
Traceback (most recent call last):
  File "/Users/amyroberts/code/transformers/../scripts/dummy_traceback.py", line 12, in <module>
    foo()
  File "/Users/amyroberts/code/transformers/../scripts/dummy_traceback.py", line 7, in foo
    c = a / b
ZeroDivisionError: division by zero

This gives the full context of what lines of code have been called and exactly the line where the error is being triggered. Without this it's very difficult for us to be able to help debug this issue.

Mar 14 '23 20:03 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 08 '23 15:04 github-actions[bot]