pykeen Evaluation results change every time when loading dataset in the same way

Evaluation results change every time when loading dataset in the same way

Open Hao-666 opened this issue 2 years ago • 10 comments

Describe the bug When loading the training data again and then doing the evaluation, the results change.

To Reproduce

import ...

TRIPLE = "triple.txt"
MODEL = "results_transR/trained_model.pkl"

tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)

results_transR = pipeline(
    training=training,
    testing=testing,
    validation=validation,
    model='transR',
    training_kwargs=dict(num_epochs=300,batch_size=512),
    random_seed=42
)
results_transR.save_to_directory('results_transR')

model = torch.load(MODEL)
model.eval()

evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])

print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))

Output:

12.692307692307692
0.33840666188758706
0.11435037038346621
0.7346938775510204
0.5698587127158555
0.4403453689167975
0.14285714285714285

Then when I loaded the dataset and did the evaluation again like below, the results changed, so on so forth.

tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)
model = torch.load(MODEL)
model.eval()

evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])

print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))

Output:

9.840659340659341
0.3852897433398676
0.0886320254506893
0.7959183673469388
0.6467817896389325
0.5062794348508635
0.17974882260596547

Expected behavior Each time the evaluation results are the same.

Environment:

Key	Value
OS	nt
Platform	Windows
Release	10
User	Hao Liu
Time	Fri Aug 6 11:15:25 2021
Python	3.7.10
PyKEEN	1.5.1-dev
PyKEEN Hash	UNHASHED
PyKEEN Branch
PyTorch	1.9.0
CUDA Available?	false
CUDA Version	N/A
cuDNN Version	N/A

Aug 06 '21 03:08 Hao-666

I output the testing triples using the codes below when running training, testing, validation = tf.split([.6, .2, .2],random_state=42) twice.

for i in range(len(testing.triples)):
    with open(TESTING_GOLD_TRIPLE, 'a') as f:
        f.write(str(testing.triples[i][0]) + "\t" + str(testing.triples[i][1]) + "\t" + str(testing.triples[i][2]) + "\n")

I found that the two sets of generated testing triples are different. Maybe that's the reason? But I have set the random state number. How comes the generated testing triples are different?

Aug 06 '21 04:08 Hao-666

I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...

Aug 06 '21 04:08 Hao-666

Hi @Hao-666,

this error might be connected to #499. @cthoyt is working on a fix in #500, so maybe he can help you here?

Aug 06 '21 09:08 mberr

Couldn't this also be related to the rank-based evaluator having a random component to it?

Alternatively, I wonder if there's a reason why using a bigger dataset causes this issue

Aug 06 '21 10:08 cthoyt

Couldn't this also be related to the rank-based evaluator having a random component to it?

The rank-based evaluator should be deterministic, except numerical effects for evaluating on the same triples in different order. This shouldn't change results as heavily as observed above.

I think cause lies in

I found that the two sets of generated testing triples are different.

Aug 06 '21 10:08 mberr

I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...

Hi @cthoyt , I am not sure about whether the rank-based evaluator has problems or not. But I am sure that the spliting results have problems as I found two sets of generated testing triples are different and also tested two small datasets here.

Aug 06 '21 10:08 Hao-666

When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same

Aug 06 '21 11:08 cthoyt

When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same

Yes. I just repeated tf = TriplesFactory.from_path(TRIPLE). I think the mappings are the same?

Aug 06 '21 11:08 Hao-666

can you check? dump the entity_ids and relation_ids dict the first time you load it with:

tf = TriplesFactory.from_path(TRIPLE)

with open('entities.json', 'w') as file:
    json.dump(tf.entity_ids, file, indent=2)
with open('relations.json', 'w') as file:
    json.dump(tf.relation_ids, file, indent=2)

Then compare those the second time

Update: when I did this, it's the same. This should be deterministic

Aug 06 '21 11:08 cthoyt

can you check? dump the entity_ids and relation_ids dict the first time you load it with:
tf = TriplesFactory.from_path(TRIPLE)

with open('entities.json', 'w') as file:
    json.dump(tf.entity_ids, file, indent=2)
with open('relations.json', 'w') as file:
    json.dump(tf.relation_ids, file, indent=2)
Then compare those the second time

Update: when I did this, it's the same. This should be deterministic

Yes I also checked just now. The mappings are the same.

Aug 06 '21 11:08 Hao-666

pykeen pykeen copied to clipboard

Evaluation results change every time when loading dataset in the same way

pykeen
pykeen copied to clipboard