benchmarking
benchmarking copied to clipboard
Lower results than in the paper for a model (probably doing something wrong)
Hello!
Thank you for a nice study and a nice repository! :D I am currently trying to re-use some of your hyperparameters from the study, e.g., those for Complex for the YAGO3-10 dataset. However, upon trying to use the config files with the current version of Pykeen, I am landing with the error that owa is not an option, but only ['lcwa', 'slcwa'] are valid options. I saw that you changed the name of OWA to SLCWA, so I switched from OWA to SLCWA as instructed.
However, training locally with Pykeen 1.9.0 and slcwa gives me very different results; on the validation set, I get extremely low results, although I seem to get pretty decent (but still far from the outputted metrics there) on the test set. For this specific run, I got for the corresponding metrics:
# Results from me re-running the best experiment config found below
'testing.both.realistic.inverse_harmonic_mean_rank': 0.3114,
'testing.both.realistic.hits_at_1': 0.2213,
'testing.both.realistic.hits_at_3': 0.3609,
'testing.both.realistic.hits_at_5': 0.419,
'testing.both.realistic.hits_at_10': 0.4864,
'validation.both.realistic.inverse_harmonic_mean_rank': 0.08471,
'validation.both.realistic.hits_at_1': 0.0234,
'validation.both.realistic.hits_at_3': 0.07252,
'validation.both.realistic.hits_at_5': 0.136,
'validation.both.realistic.hits_at_10': 0.2665,
# Results from benchmark database
'results.metrics.inverse_harmonic_mean_rank.both.realistic': 0.46196680766490855,
'results.metrics.hits_at_k.both.realistic.1': 0.3727418707346447,
'results.metrics.hits_at_k.both.realistic.3': 0.5171617824167001,
'results.metrics.hits_at_k.both.realistic.5': 0.5700521878763549,
'results.metrics.hits_at_k.both.realistic.10': 0.6230429546366921,
I attach my training script below; I am most likely doing something wrong or not considering some specific setting that was updated in the more recent version of pykeen. Thanks again for a nice tool! :)
The results from the database can be seen below.
Config file (originally this one):
{
"metadata":
{
"best_trial_evaluation": 0.6191241462434712,
"best_trial_number": 3,
"git_hash": "UNHASHED",
"version": "0.1.2-dev"
},
"pipeline":
{
"dataset": "yago310",
"dataset_kwargs": {
"create_inverse_triples": false
},
"evaluation_kwargs": {
"batch_size": null
},
"evaluator": "rankbased",
"evaluator_kwargs":
{
"filtered": true
},
"loss": "softplus",
"model": "complex",
"model_kwargs":
{
"automatic_memory_optimization": true, "embedding_dim": 256
},
"negative_sampler": "basic",
"negative_sampler_kwargs": {"num_negs_per_pos": 32},
"optimizer": "adam",
"optimizer_kwargs": {
"lr": 0.001723135381847608, "weight_decay": 0.0
},
"regularizer": "no",
"training_kwargs": {
"batch_size": 8192, "label_smoothing": 0.0, "num_epochs": 131
},
"training_loop": "owa"}
}
Running the following gives me great output metrics:
(kgvenv)filco:~/$ python3 ablation/search.py --dataset yago310 --model complex
============================== 0 ==============================
{'create_inverse_triples': False,
'dataset': 'yago310',
'evaluator': 'rankbased',
'hpo.metadata.title': 'HPO Over YAGO3-10 for ComplEx',
'hpo.optuna.direction': 'maximize',
'hpo.optuna.metric': 'hits@10',
'hpo.optuna.n_trials': 100,
'hpo.optuna.pruner': 'nop',
'hpo.optuna.sampler': 'random',
'hpo.optuna.storage': 'sqlite:////home/lauve/dataintegration/POEM_benchmarking_results/pykeen_experimental_results/ablation/config/adam/complex/yago310/random/owa/2020-05-21-02-47_1218c513-997d-483e-8d3f-3d6c144d8fdd/0001_yago310_complex/optuna_results.db',
'hpo.optuna.timeout': 86400,
'hpo.pipeline.dataset': 'yago310',
'hpo.pipeline.dataset_kwargs.create_inverse_triples': False,
'hpo.pipeline.evaluation_kwargs.batch_size': None,
'hpo.pipeline.evaluator': 'RankBasedEvaluator',
'hpo.pipeline.evaluator_kwargs.filtered': True,
'hpo.pipeline.loss': 'SoftplusLoss',
'hpo.pipeline.model': 'ComplEx',
'hpo.pipeline.model_kwargs.automatic_memory_optimization': True,
'hpo.pipeline.model_kwargs_ranges.embedding_dim.high': 8,
'hpo.pipeline.model_kwargs_ranges.embedding_dim.low': 6,
'hpo.pipeline.model_kwargs_ranges.embedding_dim.scale': 'power_two',
'hpo.pipeline.model_kwargs_ranges.embedding_dim.type': 'int',
'hpo.pipeline.negative_sampler': 'BasicNegativeSampler',
'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.high': 50,
'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.low': 1,
'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.q': 1,
'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.type': 'int',
'hpo.pipeline.optimizer': 'adam',
'hpo.pipeline.optimizer_kwargs.weight_decay': 0.0,
'hpo.pipeline.optimizer_kwargs_ranges.lr.high': 0.1,
'hpo.pipeline.optimizer_kwargs_ranges.lr.low': 0.001,
'hpo.pipeline.optimizer_kwargs_ranges.lr.scale': 'log',
'hpo.pipeline.optimizer_kwargs_ranges.lr.type': 'float',
'hpo.pipeline.regularizer': 'NoRegularizer',
'hpo.pipeline.stopper': 'early',
'hpo.pipeline.stopper_kwargs.delta': 0.002,
'hpo.pipeline.stopper_kwargs.frequency': 10,
'hpo.pipeline.stopper_kwargs.patience': 5,
'hpo.pipeline.training_kwargs.label_smoothing': 0.0,
'hpo.pipeline.training_kwargs.num_epochs': 1000,
'hpo.pipeline.training_kwargs_ranges.batch_size.high': 13,
'hpo.pipeline.training_kwargs_ranges.batch_size.low': 10,
'hpo.pipeline.training_kwargs_ranges.batch_size.scale': 'power_two',
'hpo.pipeline.training_kwargs_ranges.batch_size.type': 'int',
'hpo.pipeline.training_loop': 'owa',
'hpo.type': 'hpo',
'loss': 'softplus',
'metadata.best_trial_evaluation': 0.6191241462434712,
'metadata.best_trial_number': 3,
'metadata.git_hash': 'UNHASHED',
'metadata.version': '0.1.2-dev',
'metric': 'hits@10',
'model': 'complex',
'negative_sampler': 'basic',
'optimizer': 'adam',
'pipeline_config.metadata.best_trial_evaluation': 0.6191241462434712,
'pipeline_config.metadata.best_trial_number': 3,
'pipeline_config.metadata.git_hash': 'UNHASHED',
'pipeline_config.metadata.version': '0.1.2-dev',
'pipeline_config.pipeline.dataset': 'yago310',
'pipeline_config.pipeline.dataset_kwargs.create_inverse_triples': False,
'pipeline_config.pipeline.evaluation_kwargs.batch_size': None,
'pipeline_config.pipeline.evaluator': 'rankbased',
'pipeline_config.pipeline.evaluator_kwargs.filtered': True,
'pipeline_config.pipeline.loss': 'softplus',
'pipeline_config.pipeline.model': 'complex',
'pipeline_config.pipeline.model_kwargs.automatic_memory_optimization': True,
'pipeline_config.pipeline.model_kwargs.embedding_dim': 256,
'pipeline_config.pipeline.negative_sampler': 'basic',
'pipeline_config.pipeline.negative_sampler_kwargs.num_negs_per_pos': 32,
'pipeline_config.pipeline.optimizer': 'adam',
'pipeline_config.pipeline.optimizer_kwargs.lr': 0.001723135381847608,
'pipeline_config.pipeline.optimizer_kwargs.weight_decay': 0.0,
'pipeline_config.pipeline.regularizer': 'no',
'pipeline_config.pipeline.training_kwargs.batch_size': 8192,
'pipeline_config.pipeline.training_kwargs.label_smoothing': 0.0,
'pipeline_config.pipeline.training_kwargs.num_epochs': 131,
'pipeline_config.pipeline.training_loop': 'owa',
'pykeen_git_hash': 'UNHASHED',
'pykeen_version': '0.1.2-dev',
'regularizer': 'no',
'replicate': 0,
...
'results.metrics.hits_at_k.both.realistic.1': 0.3727418707346447,
'results.metrics.hits_at_k.both.realistic.10': 0.6230429546366921,
'results.metrics.hits_at_k.both.realistic.3': 0.5171617824167001,
'results.metrics.hits_at_k.both.realistic.5': 0.5700521878763549,
'results.metrics.inverse_harmonic_mean_rank.both.realistic': 0.46196680766490855,
...
'searcher': 'random',
'training_loop': 'owa'}
Version:
>>> pykeen.get_version()
'1.9.0'
import json
import os
import wandb
from pykeen.trackers import WANDBResultTracker, CSVResultTracker
from pykeen import pipeline
from pykeen import datasets
from utils import flatten_dict
import argparse
PROJECT_NAME = "Pykeen Knowledge Graph Embeddings"
DSETNAME2DSET = {
"kinships": "Kinships",
"fb15k": "FB15k",
"fb15k237": "FB15k237",
"wn18": "WN18",
"wn18rr": "WN18RR",
"yago310": "YAGO310",
}
def run_transductive(config: dict, use_wandb: bool):
if use_wandb:
print("Using wandb tracker")
wandb.init(
project=PROJECT_NAME,
entity=ENTITYNAME,
name=f"{config['pipeline']['model']}-{config['pipeline']['dataset']}",
)
tracker = WANDBResultTracker(
project=PROJECT_NAME,
entity=ENTITYNAME,
group=None,
settings=wandb.Settings(start_method="fork"),
)
# tracker.wandb.config.update(flatten_dict(config))
tracker.wandb.name = (
f"{config['pipeline']['model']}-{config['pipeline']['dataset']}"
)
else:
tracker = CSVResultTracker()
dataset = getattr(datasets, DSETNAME2DSET[config["pipeline"]["dataset"]])(
create_inverse_triples=config["pipeline"]["dataset_kwargs"]["create_inverse_triples"]
)
if "callbacks" not in config["pipeline"]["training_kwargs"]:
config["pipeline"]["training_kwargs"]["callbacks"] = ["evaluation-loop"]
if "callback_kwargs" not in config["pipeline"]["training_kwargs"]:
config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
"prefix": "validation"
}
if "automatic_memory_optimization" in config["pipeline"]["model_kwargs"]:
optimize_memory = config["pipeline"]["model_kwargs"].pop(
"automatic_memory_optimization"
)
config["pipeline"]["training_loop_kwargs"] = {}
config["pipeline"]["training_loop_kwargs"][
"automatic_memory_optimization"
] = optimize_memory
config["pipeline"]["evaluator_kwargs"][
"automatic_memory_optimization"
] = optimize_memory
if config["pipeline"]["training_loop"] == "owa":
config["pipeline"]["training_loop"] = "slcwa" # Change to renamed training loop
config["pipeline"]["training_kwargs"]["callback_kwargs"][
"factory"
] = dataset.validation # Add validation dataset to callback kwargs
config["pipeline"]["result_tracker"] = tracker
pipeline_results = pipeline.pipeline_from_config(config)
if use_wandb:
tracker.log_metrics(
metrics=pipeline_results.metric_results.to_flat_dict(),
prefix="test",
)
tracker.wandb.finish()
Is there some change in the packages since it was last run that causes this mismatch, or am I perhaps using the package incorrectly? Thank you for your time, and thank you for your package!
To further update; I realized I constrained my search to only using sLCWA runs, so the one above does not correspond to the run presented in the paper (Table 18). However, switching to the one in the paper give me a 0.94 instead of 0.98 on ComplEx, but I think that should be good enough given that result can never be exact. RotatE also seem to give decent results on Kinship now (0.98 hits@10). But I still do not know why my results are remarkably lower for the settings above.
The validation curves for the different models are also very strange, but I guess that is a consequence of large hyperparameter tuning :)
Hi @Filco306
if you are looking at the validation curves generated by the EvaluationLoopTrainingCallback,
if "callbacks" not in config["pipeline"]["training_kwargs"]:
config["pipeline"]["training_kwargs"]["callbacks"] = ["evaluation-loop"]
if "callback_kwargs" not in config["pipeline"]["training_kwargs"]:
config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
"prefix": "validation"
}
you may be missing to filter with training triples, too. To do so, you would need to pass the additional key additional_filter_triples to callback_kwargs, i.e.,
config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
"prefix": "validation",
"additional_filter_triples": dataset.training,
}
This is a bit hidden, since this parameter goes from the EvaluationLoopTrainingCallback.__init__ via kwargs through pykeen.evaluation.Evaluator.evaluate to pykeen.evaluation.evaluate 😅
Hi there,
Thank you for your reply! :D I will re-run the experiment in question with your comment in my mind and see if the fixes the results. If not, I'll get back to you :)
Thanks! :D
One more thing I noticed: https://pykeen.readthedocs.io/en/stable/api/pykeen.training.callbacks.EvaluationLoopTrainingCallback.html also needs the factory on which to evaluate, i.e.,
config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
"prefix": "validation",
"factory": dataset.validation,
"additional_filter_triples": dataset.training,
}
Hi again @mberr ,
If I add what you write,
config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
"prefix": "validation",
"factory": dataset.validation,
"additional_filter_triples": dataset.training,
}
I get the error:
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/training_loop.py", line 378, in train
result = self._train(
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/training_loop.py", line 734, in _train
callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/callbacks.py", line 438, in post_epoch
callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/callbacks.py", line 325, in post_epoch
result = self.evaluation_loop.evaluate(**self.kwargs)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 196, in evaluate
return _evaluate(
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch_max_mem/api.py", line 293, in inner
result, self.parameter_value[h] = wrapped(*args, **kwargs)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch_max_mem/api.py", line 193, in wrapper_maximize_memory_utilization
func(*bound_arguments.args, **p_kwargs, **bound_arguments.kwargs),
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 82, in _evaluate
loader = loop.get_loader(batch_size=batch_size, **kwargs)
File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 149, in get_loader
return DataLoader(
TypeError: DataLoader.__init__() got an unexpected keyword argument 'additional_filter_triples'
Would you know what the issue is here? Note that I still get the warning WARNING:pykeen.evaluation.evaluation_loop:Enabled filtered evaluation, but not additional filter triples are passed., so it does not seem to be passed properly.
Okay, this seems to be a bug in EvaluationLoop, which does not properly forward this argument to instantiate the LCWAEvaluationDataset here.
I used this smaller snippet to reproduce your error
from pykeen.pipeline import pipeline
from pykeen.datasets import get_dataset
dataset = get_dataset(dataset="nations")
result = pipeline(
dataset=dataset,
model="mure",
training_kwargs=dict(
num_epochs=5,
callbacks="evaluation-loop",
callback_kwargs=dict(
frequency=1,
prefix="validation",
factory=dataset.validation,
additional_filter_triples=dataset.training,
),
),
)
EDIT: I opened a ticket here: https://github.com/pykeen/pykeen/issues/1213
Hello again @mberr ,
Thank you for this! Yes, I believe it is a bug. Thank you for flagging it!