pykeen
pykeen copied to clipboard
Evaluation results change every time when loading dataset in the same way
Describe the bug When loading the training data again and then doing the evaluation, the results change.
To Reproduce
import ...
TRIPLE = "triple.txt"
MODEL = "results_transR/trained_model.pkl"
tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)
results_transR = pipeline(
training=training,
testing=testing,
validation=validation,
model='transR',
training_kwargs=dict(num_epochs=300,batch_size=512),
random_seed=42
)
results_transR.save_to_directory('results_transR')
model = torch.load(MODEL)
model.eval()
evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])
print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))
Output:
12.692307692307692
0.33840666188758706
0.11435037038346621
0.7346938775510204
0.5698587127158555
0.4403453689167975
0.14285714285714285
Then when I loaded the dataset and did the evaluation again like below, the results changed, so on so forth.
tf = TriplesFactory.from_path(TRIPLE)
training, testing, validation = tf.split([.6, .2, .2],random_state=42)
model = torch.load(MODEL)
model.eval()
evaluator = RankBasedEvaluator()
evaluations = evaluator.evaluate(model=model, mapped_triples=testing.mapped_triples, batch_size=256,additional_filter_triples=[training.mapped_triples])
print(evaluations.get_metric('mean rank'))
print(evaluations.get_metric('mean reciprocal rank'))
print(evaluations.get_metric('adjusted mean rank'))
print(evaluations.get_metric('hits@10'))
print(evaluations.get_metric('hits@5'))
print(evaluations.get_metric('hits@3'))
print(evaluations.get_metric('hits@1'))
Output:
9.840659340659341
0.3852897433398676
0.0886320254506893
0.7959183673469388
0.6467817896389325
0.5062794348508635
0.17974882260596547
Expected behavior Each time the evaluation results are the same.
Environment:
Key | Value |
---|---|
OS | nt |
Platform | Windows |
Release | 10 |
User | Hao Liu |
Time | Fri Aug 6 11:15:25 2021 |
Python | 3.7.10 |
PyKEEN | 1.5.1-dev |
PyKEEN Hash | UNHASHED |
PyKEEN Branch | |
PyTorch | 1.9.0 |
CUDA Available? | false |
CUDA Version | N/A |
cuDNN Version | N/A |
I output the testing triples using the codes below when running training, testing, validation = tf.split([.6, .2, .2],random_state=42)
twice.
for i in range(len(testing.triples)):
with open(TESTING_GOLD_TRIPLE, 'a') as f:
f.write(str(testing.triples[i][0]) + "\t" + str(testing.triples[i][1]) + "\t" + str(testing.triples[i][2]) + "\n")
I found that the two sets of generated testing triples are different. Maybe that's the reason? But I have set the random state number. How comes the generated testing triples are different?
I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...
Hi @Hao-666,
this error might be connected to #499. @cthoyt is working on a fix in #500, so maybe he can help you here?
Couldn't this also be related to the rank-based evaluator having a random component to it?
Alternatively, I wonder if there's a reason why using a bigger dataset causes this issue
Couldn't this also be related to the rank-based evaluator having a random component to it?
The rank-based evaluator should be deterministic, except numerical effects for evaluating on the same triples in different order. This shouldn't change results as heavily as observed above.
I think cause lies in
I found that the two sets of generated testing triples are different.
I further found that when I used this dataset, there is no problem (i.e., the generated testing triples under multiple times are the same). But for this one, the problem is as described above. Strange...
Hi @cthoyt , I am not sure about whether the rank-based evaluator has problems or not. But I am sure that the spliting results have problems as I found two sets of generated testing triples are different and also tested two small datasets here.
When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same
When you reload the triples, did you maintain the same entity to id and relation to id mappings? You should check that they're the same
Yes. I just repeated tf = TriplesFactory.from_path(TRIPLE)
. I think the mappings are the same?
can you check? dump the entity_ids and relation_ids dict the first time you load it with:
tf = TriplesFactory.from_path(TRIPLE)
with open('entities.json', 'w') as file:
json.dump(tf.entity_ids, file, indent=2)
with open('relations.json', 'w') as file:
json.dump(tf.relation_ids, file, indent=2)
Then compare those the second time
Update: when I did this, it's the same. This should be deterministic
can you check? dump the entity_ids and relation_ids dict the first time you load it with:
tf = TriplesFactory.from_path(TRIPLE) with open('entities.json', 'w') as file: json.dump(tf.entity_ids, file, indent=2) with open('relations.json', 'w') as file: json.dump(tf.relation_ids, file, indent=2)
Then compare those the second time
Update: when I did this, it's the same. This should be deterministic
Yes I also checked just now. The mappings are the same.