pykeen icon indicating copy to clipboard operation
pykeen copied to clipboard

Evaluation results change every time when loading dataset in the same way

Open Hao-666 opened this issue 2 years ago • 10 comments

Describe the bug When loading the training data again and then doing the evaluation, the results change.

To Reproduce

import ...

TRIPLE = "triple.txt"
MODEL = "results_transR/trained_model.pkl"

tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)

results_transR = pipeline(
    training=training,
    testing=testing,
    validation=validation,
    model='transR',
    training_kwargs=dict(num_epochs=300,batch_size=512),
    random_seed=42
)
results_transR.save_to_directory('results_transR')

model = torch.load(MODEL)
model.eval()

evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])

print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))

Output:

12.692307692307692
0.33840666188758706
0.11435037038346621
0.7346938775510204
0.5698587127158555
0.4403453689167975
0.14285714285714285

Then when I loaded the dataset and did the evaluation again like below, the results changed, so on so forth.

tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)
model = torch.load(MODEL)
model.eval()

evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])

print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))

Output:

9.840659340659341
0.3852897433398676
0.0886320254506893
0.7959183673469388
0.6467817896389325
0.5062794348508635
0.17974882260596547

Expected behavior Each time the evaluation results are the same.

Environment:

Key Value
OS nt
Platform Windows
Release 10
User Hao Liu
Time Fri Aug 6 11:15:25 2021
Python 3.7.10
PyKEEN 1.5.1-dev
PyKEEN Hash UNHASHED
PyKEEN Branch
PyTorch 1.9.0
CUDA Available? false
CUDA Version N/A
cuDNN Version N/A

Hao-666 avatar Aug 06 '21 03:08 Hao-666

I output the testing triples using the codes below when running training, testing, validation = tf.split([.6, .2, .2],random_state=42) twice.

for i in range(len(testing.triples)):
    with open(TESTING_GOLD_TRIPLE, 'a') as f:
        f.write(str(testing.triples[i][0]) + "\t" + str(testing.triples[i][1]) + "\t" + str(testing.triples[i][2]) + "\n")

I found that the two sets of generated testing triples are different. Maybe that's the reason? But I have set the random state number. How comes the generated testing triples are different?

Hao-666 avatar Aug 06 '21 04:08 Hao-666

I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...

Hao-666 avatar Aug 06 '21 04:08 Hao-666

Hi @Hao-666,

this error might be connected to #499. @cthoyt is working on a fix in #500, so maybe he can help you here?

mberr avatar Aug 06 '21 09:08 mberr

Couldn't this also be related to the rank-based evaluator having a random component to it?

Alternatively, I wonder if there's a reason why using a bigger dataset causes this issue

cthoyt avatar Aug 06 '21 10:08 cthoyt

Couldn't this also be related to the rank-based evaluator having a random component to it?

The rank-based evaluator should be deterministic, except numerical effects for evaluating on the same triples in different order. This shouldn't change results as heavily as observed above.

I think cause lies in

I found that the two sets of generated testing triples are different.

mberr avatar Aug 06 '21 10:08 mberr

I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...

Hi @cthoyt , I am not sure about whether the rank-based evaluator has problems or not. But I am sure that the spliting results have problems as I found two sets of generated testing triples are different and also tested two small datasets here.

Hao-666 avatar Aug 06 '21 10:08 Hao-666

When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same

cthoyt avatar Aug 06 '21 11:08 cthoyt

When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same

Yes. I just repeated tf = TriplesFactory.from_path(TRIPLE). I think the mappings are the same?

Hao-666 avatar Aug 06 '21 11:08 Hao-666

can you check? dump the entity_ids and relation_ids dict the first time you load it with:

tf = TriplesFactory.from_path(TRIPLE)

with open('entities.json', 'w') as file:
    json.dump(tf.entity_ids, file, indent=2)
with open('relations.json', 'w') as file:
    json.dump(tf.relation_ids, file, indent=2)

Then compare those the second time

Update: when I did this, it's the same. This should be deterministic

cthoyt avatar Aug 06 '21 11:08 cthoyt

can you check? dump the entity_ids and relation_ids dict the first time you load it with:

tf = TriplesFactory.from_path(TRIPLE)

with open('entities.json', 'w') as file:
    json.dump(tf.entity_ids, file, indent=2)
with open('relations.json', 'w') as file:
    json.dump(tf.relation_ids, file, indent=2)

Then compare those the second time

Update: when I did this, it's the same. This should be deterministic

Yes I also checked just now. The mappings are the same.

Hao-666 avatar Aug 06 '21 11:08 Hao-666