transformers
transformers copied to clipboard
model fine-tuning error
System Info
Hi to all, I fine-tune a model for my dataset.
But I need some help with the inference execution. I train the model with a tokenizer without truncation, but I receive the first error in inference. So I tried to retrain the model with truncation activated, as shown in the code. Still, I encountered a new error during the training that only appeared after adding truncation into the tokenizer.
Now, if I try to train the network without truncation on the tokenizer, the training not working, and I need help understanding what happens.
Who can help?
@ArthurZucker @sgugger
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Inference error: RuntimeError: The size of tensor a (726) must match the size of tensor b (512) at non-singleton dimension 1 The function for classification is:
def classification_infer(data, model_path):
# device
device = find_device()
# data preprocessing
data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)
data = data.loc[~data['Classe'].isin([0, 1, 2])]
# model loading
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=3)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Model loaded")
print("Classifying...")
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer, device=device)
tokenizer_kwargs = {'padding': True, 'truncation': True, 'max_length': 512, 'return_tensors': 'pt'}
classifier_output = classifier(data['usefull'].tolist())
print("Classification completed")
data['Classe'] = [int(x['label']) for x in classifier_output]
return data
Training function, with truncation active:
def classification_train(data):
# metric
metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Load tokenizer
tokenizer_kwargs = {'padding': True, 'truncation': True, 'max_length': 512, 'return_tensors': 'pt'}
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased", tokenizer_kwargs=tokenizer_kwargs)
# preprocessing function
def preprocess_function(data):
return tokenizer(data['text'])
# device
device = find_device()
# data preprocessing
data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)
# dataset creation
dataset = pd.DataFrame()
dataset[['text', 'label']] = data.loc[data['Classe'].isin([0, 1, 2]), ['usefull', 'Classe']]
dataset['label'] = dataset['label'].astype(int)
# dataset equilibrium
dataset = dataset.groupby('label').head(100)
dataset['text'] = dataset['text'].map(lambda x: x.lower())
# dataset split
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
# huggingface dataset
train = Dataset.from_pandas(train)
train = train.map(preprocess_function, batched=True)
test = Dataset.from_pandas(test)
test = test.map(preprocess_function, batched=True)
# Load model
model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-italian-cased", num_labels=3)
model.to(device)
# data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# training arguments
training_args = TrainingArguments(
output_dir='./model_weight', # output directory
num_train_epochs=2, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
weight_decay=0.01, # strength of weight decay
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model='eval_accuracy',
greater_is_better=True,
report_to="wandb", # enable logging to W&B
run_name="bert-base-italian-cased-fit-for-crm", # name of the W&B run (optional)
label_names=['0', '1', '2']
)
# trainer
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train, # training dataset
eval_dataset=test, # evaluation dataset
data_collator=data_collator, # data collator
tokenizer=tokenizer, # tokenizer
compute_metrics=compute_metrics, # the callback that computes metrics of interest
)
# train
trainer.train()
Error during the evaluation step with the training function with truncation active: IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed
training function withouth truncation:
def classification_train(data):
# metric
metric = evaluate.load('accuracy')
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
# preprocessing function
def preprocess_function(data):
return tokenizer(data['text'], truncation=False)
# device
device = find_device()
# data preprocessing
data[['notuseful', 'usefull']] = data['Descrizione'].apply(text_splitting)
# dataset creation
dataset = pd.DataFrame()
dataset[['text', 'label']] = data.loc[data['Classe'].isin([0, 1, 2]), ['usefull', 'Classe']]
dataset['label'] = dataset['label'].astype(int)
# dataset equilibrium
dataset = dataset.groupby('label').head(100)
dataset['text'] = dataset['text'].map(lambda x: x.lower())
# dataset split
train, test = train_test_split(dataset, test_size=0.2, random_state=42)
# huggingface dataset
train = Dataset.from_pandas(train)
train = train.map(preprocess_function, batched=True)
test = Dataset.from_pandas(test)
test = test.map(preprocess_function, batched=True)
# Load model
model = AutoModelForSequenceClassification.from_pretrained("dbmdz/bert-base-italian-cased", num_labels=3)
model.to(device)
# data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# training arguments
training_args = TrainingArguments(
output_dir='./model_weight', # output directory
num_train_epochs=2, # total number of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
weight_decay=0.01, # strength of weight decay
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model='eval_accuracy',
greater_is_better=True,
report_to="wandb", # enable logging to W&B
run_name="bert-base-italian-cased-fit-for-crm", # name of the W&B run (optional)
label_names=['0', '1', '2']
)
# trainer
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train, # training dataset
eval_dataset=test, # evaluation dataset
data_collator=data_collator, # data collator
tokenizer=tokenizer, # tokenizer
compute_metrics=compute_metrics, # the callback that computes metrics of interest
)
# train
trainer.train()
Expected behavior
I expect that with the training function, I can train the network with the tokenizer arguments, and I expect the inference to work when I try to classify a text
I have found the solution. The problem is in the Bach dimension for the evaluation that is greater than the number of rows inside the evaluation dataset. Now I have set it to 16 and it works.
But now I have a problem with the inference because I have trained the model with tokenizer max_length settings seated to 1024 and model with the same settings but when I use the model weight for the inference I continue to receive the error about the size of the tensor:
RuntimeError: The size of tensor a (726) must match the size of tensor b (512) at non-singleton dimension 1
Hey! Could you give more details on the exact trace that you are getting? I have no idea where it comes from so can't really help, it could be a problem with loading the checkpoints or anything. Also can you share a simple inference reproducing script? Thanks!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.