setfit
setfit copied to clipboard
ValueError: A column mapping must be provided when the dataset does not contain the following columns: {'text', 'label'}
I am getting this value error in trainer.evaluate() despite my dataset containing columns named 'text' and 'label'
my dataset dict looks like this: Dataset({ features: ['text', 'reason', 'label'], num_rows: 2062 })
@LuketheDukeBates can you please help?
Hello @Cheril184,
Map the features like this. Use the column_mapping
to map with the column names.
Also share the error as well for better understanding.
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
num_iterations=20,
column_mapping={"text": "text", "label": "label"},
)
If the datasets contain the text
and label
columns, then a column mapping should not be needed.
I'm unable to reproduce your issue. I'm using the following dummy script in my attempt:
from setfit import SetFitModel, SetFitTrainer
from datasets import Dataset
ds = Dataset.from_dict({"text": ["a", "b", "c"], "reason": ["d", "e", "f"], "label": [0, 1, 2]})
print(ds)
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
trainer = SetFitTrainer(
model=model,
train_dataset=ds,
eval_dataset=ds,
num_iterations=20,
)
trainer.train()
print(trainer.evaluate())
This outputs
Dataset({
features: ['text', 'reason', 'label'],
num_rows: 3
})
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
***** Running training *****
Num examples = 120
Num epochs = 1
Total optimization steps = 8
Total train batch size = 16
Epoch: 0%| | 0/1 [00:00<?, ?it/s]
Iteration: 0%| | 0/8 [00:00<?, ?it/s]
Iteration: 12%|███████████████████ | 1/8 [00:02<00:15, 2.27s/it]
Iteration: 25%|██████████████████████████████████████ | 2/8 [00:02<00:07, 1.28s/it]
Iteration: 38%|█████████████████████████████████████████████████████████ | 3/8 [00:02<00:04, 1.07it/s]
Iteration: 50%|████████████████████████████████████████████████████████████████████████████ | 4/8 [00:03<00:03, 1.31it/s]
Iteration: 62%|███████████████████████████████████████████████████████████████████████████████████████████████ | 5/8 [00:03<00:01, 1.52it/s]
Iteration: 75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 6/8 [00:03<00:01, 1.70it/s]
Iteration: 88%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 7/8 [00:04<00:00, 1.86it/s]
Iteration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00, 1.84it/s]
Epoch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.35s/it]
***** Running evaluation *****
{'accuracy': 1.0}
Which is to be expected.
Could you provide a way for us to reproduce your issue, i.e. via your own script or by modifying mine?
- Tom Aarsen
@tomaarsen I have a csv file which I am loading using the hf load_dataset function, I can share my csv files if you want
That might help. Furthermore, I am interested in how you initialize the SetFitTrainer
. Perhaps the issues lies there somewhere?
FWIW, one reason this might happen is that the default CSV loader script in datasets
(in version 2.8.0, at the time of writing) automatically assigns all the points to the training split if you don't specify one (unlike e.g. Tensorflow Datasets, where the default is no train-test split in the dataset):
import datasets
with open('example.csv', 'w') as example:
example.writelines(["label,text\n","1,good\n","0,terrible\n"])
features = datasets.Features(
{
'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
'text': datasets.Value('string')
})
dataset = datasets.load_dataset("csv", data_files="./example.csv", features=features)
print(dataset)
# Prints
# DatasetDict({
# train: Dataset({
# features: ['label', 'text'],
# num_rows: 2
# })
# })
If this is the case, just pass the training split dataset['train']
to the trainer!
EDIT: The original example here did not pass features
to load_dataset
, but the behavior of automatically assigning the rows to the training split remains the same.
@tomaarsen I'm also getting this issue - I attempted to fix it by doing 'column_mapping={"text": "text", "label": "label"}'. I'm still getting the error: ValueError: The following columns are missing from the dataset: {'text'}. Please provide a mapping for all required columns.
I am using the Dataset.from_pandas method.
@lbluett Does this dummy script work?
from setfit import SetFitModel, SetFitTrainer
import datasets
import pandas as pd
df = pd.DataFrame([
{'text': 'a', 'label': 0},
{'text': 'b', 'label': 1},
{'text': 'c', 'label': 2},
])
features = datasets.Features(
{
'label': datasets.ClassLabel(num_classes=3, names=['a','b','c']),
'text': datasets.Value('string')
})
ds = datasets.Dataset.from_pandas(df, features=features)
print(ds)
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
trainer = SetFitTrainer(
model=model,
train_dataset=ds,
eval_dataset=ds,
num_iterations=20,
)
trainer.train()
print(trainer.evaluate())
Also worth noting that from_pandas
does not exhibit the same behavior as the CSV loading script, and will not set a training split (as expected):
import pandas as pd
import datasets
features = datasets.Features(
{
'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
'text': datasets.Value('string')
})
df = pd.DataFrame([{'label': 1, 'text': 'good'}, {'label': 0, 'text': 'terrible'}])
print(datasets.Dataset.from_pandas(df, features=features))
# Prints
# Dataset({
# features: ['label', 'text'],
# num_rows: 2
# })
# note: no training split
I'm getting the same error. Would love to know how this can be solved.
EDIT It worked eventually. A bit of a long shot, but here's what I did:
- I had train.csv and test.csv, each having 2 columns: "text" and "label"
- train_df = pd.read_csv("train.csv") and test_df = pd.read_csv("test.csv")
- train_text = train_df["text"] and train_label = train_df["label"]
- train_data = Dataset.from_dict({ "text" : train_text, "label" : "train_label" })
- Repeat steps 3 and 4 for test_data
- Finally, pass the train_data and test_data in SetFitTrainer
@lbluett Does this dummy script work?
from setfit import SetFitModel, SetFitTrainer import datasets import pandas as pd df = pd.DataFrame([ {'text': 'a', 'label': 0}, {'text': 'b', 'label': 1}, {'text': 'c', 'label': 2}, ]) features = datasets.Features( { 'label': datasets.ClassLabel(num_classes=3, names=['a','b','c']), 'text': datasets.Value('string') }) ds = datasets.Dataset.from_pandas(df, features=features) print(ds) model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") trainer = SetFitTrainer( model=model, train_dataset=ds, eval_dataset=ds, num_iterations=20, ) trainer.train() print(trainer.evaluate())
Also worth noting that
from_pandas
does not exhibit the same behavior as the CSV loading script, and will not set a training split (as expected):import pandas as pd import datasets features = datasets.Features( { 'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']), 'text': datasets.Value('string') }) df = pd.DataFrame([{'label': 1, 'text': 'good'}, {'label': 0, 'text': 'terrible'}]) print(datasets.Dataset.from_pandas(df, features=features)) # Prints # Dataset({ # features: ['label', 'text'], # num_rows: 2 # }) # note: no training split
I managed to work-around the issue by keeping my original column names and then using the column_mapping function, for example: column_mapping={"Joined_description": "text", "Target Category": "label"}
I think this might be more of a problem with Datasets' interface than with the library itself. The case where Datasets automatically assigns a training split for CSV files should be possible to detect and give an appropriate warning/error for, but without code to reproduce the issues, it's hard to tell which other gotchas should be warned about.
@lbluett That's good to hear, still odd that the column mapping is necessary - does the script in the post you cited throw any errors when you run it?
@nazianafis Does it work if you first explicitly specify the features and then pick the training split? E.g.
features = datasets.Features(
{
'label': datasets.ClassLabel(num_classes=2, names=['positive','negative']),
'text': datasets.Value('string')
})
train_data = datasets.load_dataset('csv', data_files="./train.csv", features=features)['train']
# Not a typo: unless we specify a split, everything is assigned to 'train'
test_data = datasets.load_dataset('csv', data_files="./test.csv", features=features)['train']
If not, does it work if you substitute steps 3 through 5 with
train_data = datasets.Dataset.from_pandas(train_df[['text', 'label']]))
test_data = datasets.Dataset.from_pandas(test_df[['text', 'label']]))
Does it work if you first explicitly specify the features and then pick the training split?
@jaalu It worked. Thanks a lot!