MultiWOZ_Evaluation icon indicating copy to clipboard operation
MultiWOZ_Evaluation copied to clipboard

Problems getting perfect metrics

Open jadermcs opened this issue 2 years ago • 0 comments

I am testing the UBAR method in mwoz2.2 and created a parser to convert back the generated sequence into a state dict. First I loaded groundtruth data with datasets and parsed to "predicted" dictionary, then I computed the metrics:

datasets = load_dataset("json", data_files={
            "train": "data/multiwoz/train/encoded.json",
            "valid": "data/multiwoz/dev/encoded.json",
            "test": "data/multiwoz/test/encoded.json",
        })

predicted = {}

for d in datasets["test"]:
    id = d["id"].rstrip(".json").lower()
    turns = []
    for belief in d["text"].split("<sos_b>")[1:]:
        bs = parse_state(belief.split("<eos_b>")[0])
        response = belief.split("<sos_r>")[1].split("<eos_r>")[0]
        state = {"response": response, "state":{}}
        for k,v in bs:
            state["state"][k] = v
        turns.append(state)
    predicted[id] = turns

e = Evaluator(bleu=True, success=True, richness=True)
results = e.evaluate(predicted)
print(results)

>>> {'bleu': {'mwz22': 99.15078876656015},
'success': {'inform': {'total': 93.0, 'train': 95.6, 'restaurant': 95.9, 'hotel': 95.9, 'taxi': 100.0, 'attraction': 96.0}, 'success': {'total': 88.1, 'train': 89.1, 'restaurant': 90.2, 'hotel': 87.6, 'taxi': 90.8, 'attraction': 90.7}},
'richness': {'entropy': 7.218217822046144, 'cond_entropy': 3.3791865228994378, 'avg_lengths': 14.094411285946826, 'msttr': 0.7501539942252144, 'num_unigrams': 1467, 'num_bigrams': 11614, 'num_trigrams': 25497}, 'dst': None}

I didn't get the 100% for inform and success.

Then I compared it with the golden states:

with open("venv/lib/python3.10/site-packages/mwzeval/data/gold_states.json") as fin:
        data = json.load(fin)

counter = 0
for key in predicted:
    for i, value in enumerate(predicted[key]):
        if value["state"] != data[key][i]:
            counter += 1
            print(key, i)
            pprint(value["response"])
            pprint(value["state"])
            pprint(data[key][i])
if not counter:
    print("100% matched")

As output I got that it matched 100%. It is the expected behavior? I am missing something?

jadermcs avatar Sep 09 '22 19:09 jadermcs