MultiWOZ_Evaluation
MultiWOZ_Evaluation copied to clipboard
Problems getting perfect metrics
I am testing the UBAR method in mwoz2.2 and created a parser to convert back the generated sequence into a state dict. First I loaded groundtruth data with datasets and parsed to "predicted" dictionary, then I computed the metrics:
datasets = load_dataset("json", data_files={
"train": "data/multiwoz/train/encoded.json",
"valid": "data/multiwoz/dev/encoded.json",
"test": "data/multiwoz/test/encoded.json",
})
predicted = {}
for d in datasets["test"]:
id = d["id"].rstrip(".json").lower()
turns = []
for belief in d["text"].split("<sos_b>")[1:]:
bs = parse_state(belief.split("<eos_b>")[0])
response = belief.split("<sos_r>")[1].split("<eos_r>")[0]
state = {"response": response, "state":{}}
for k,v in bs:
state["state"][k] = v
turns.append(state)
predicted[id] = turns
e = Evaluator(bleu=True, success=True, richness=True)
results = e.evaluate(predicted)
print(results)
>>> {'bleu': {'mwz22': 99.15078876656015},
'success': {'inform': {'total': 93.0, 'train': 95.6, 'restaurant': 95.9, 'hotel': 95.9, 'taxi': 100.0, 'attraction': 96.0}, 'success': {'total': 88.1, 'train': 89.1, 'restaurant': 90.2, 'hotel': 87.6, 'taxi': 90.8, 'attraction': 90.7}},
'richness': {'entropy': 7.218217822046144, 'cond_entropy': 3.3791865228994378, 'avg_lengths': 14.094411285946826, 'msttr': 0.7501539942252144, 'num_unigrams': 1467, 'num_bigrams': 11614, 'num_trigrams': 25497}, 'dst': None}
I didn't get the 100% for inform and success.
Then I compared it with the golden states:
with open("venv/lib/python3.10/site-packages/mwzeval/data/gold_states.json") as fin:
data = json.load(fin)
counter = 0
for key in predicted:
for i, value in enumerate(predicted[key]):
if value["state"] != data[key][i]:
counter += 1
print(key, i)
pprint(value["response"])
pprint(value["state"])
pprint(data[key][i])
if not counter:
print("100% matched")
As output I got that it matched 100%. It is the expected behavior? I am missing something?