syne-tune
syne-tune copied to clipboard
Experiment Results Contain Random Rows
In my experiment, the result data frame contains multiple rows with trial id 1 with the same content as the next row, the only difference being the config. This causes problems since sometimes the best config is now trial id 1 that shows a config which did not achieve the best performance.
See this example: True trial id 1 performance is 81% (Row 4) but trial id 1 also shows up in row 10 with highest accuracy. I've added a simple example to reproduce this behavior.
from pathlib import Path
from sagemaker.pytorch import PyTorch
from syne_tune.backend import SageMakerBackend
from sagemaker import get_execution_role
from syne_tune.optimizer.baselines import RandomSearch
from syne_tune import Tuner
from syne_tune.config_space import randint
from syne_tune import StoppingCriterion
from syne_tune.optimizer.schedulers.fifo import FIFOScheduler
entry_point = Path('examples') / "training_scripts" / "height_example" / "train_height.py"
assert entry_point.is_file(), 'File unknown'
mode = "min"
metric = "mean_loss"
instance_type = 'ml.c5.4xlarge'
instance_count = 1
instance_max_time = 999
n_workers = 20
config_space = {
"steps": 1,
"width": randint(0, 20),
"height": randint(-100, 100)
}
backend = SageMakerBackend(
sm_estimator=PyTorch(
entry_point=str(entry_point),
instance_type=instance_type,
instance_count=instance_count,
role=get_execution_role(),
max_run=instance_max_time,
py_version='py3',
framework_version='1.6',
),
metrics_names=[metric],
)
# Random search without stopping
scheduler = FIFOScheduler(
config_space=config_space,
searcher='random',
mode=mode,
metric=metric,
)
tuner = Tuner(
trial_backend=backend,
scheduler=scheduler,
stop_criterion=StoppingCriterion(max_wallclock_time=300),
n_workers=n_workers,
)
tuner.run()
OK, I know you guys just hate it, but I'd still love to see some debug_log output. Then, we could see what that trial_id 1 is really doing over the course of the experiment.
I don't get one thing. Your config space has height and width, but the table shows config_lr.
I am also missing the step attribute
What I'd like to see is the results dataframe, which is just very different from that table. In the end, you create your plots and so on from the results, right?
The table refers to my original experiment. I created the example later and confirmed that the same happens but didn't take another screenshot. My original experiment had only one hyperparameter: lr
.
How is this table different from the results dataframe? This is load_experiment(tuner.name).results
.
I have no problem with debug_log, I wasn't aware of it and I am not sure how to properly use it. As you've suggested, I activated it by passing it to the searcher and checked only the output on the console. Is this the intended use or does it write more logs somewhere else? Otherwise I couldn't spot anything suspicious. Trial id 1 is actually finished. Nevertheless, more results are reported. I am currently running a much longer experiment. Let's see how frequently trial id 1 pops up.
Longer experiment: 287 rows in the table, 278 trials in total. Only trial id 1 occurs multiple times and it occurs within the first 24 rows (20 workers).
I don't know what load_experiment(...).results is doing. I'd recommend loading the CSV directly and checking what is going on. Thanks for raising this. We need to figure this out. Does this happen for local backend as well? I've not been using the SageMaker backend much at all, it may have quite some glitches.
That's basically what the function does, loading the results.csv.zip: https://github.com/awslabs/syne-tune/blob/main/syne_tune/experiments.py#L144 I've shared my results.csv (and I believe this one is for the example snippet above) internally.
I didn't face problems using only a single worker. I'll try many workers on local backend and let you know.
I started to use SM backend more frequently now. Fortunately that's the first and only one I've faced. Let's hope it is the last one as well.
Aaron is also using it more now, so let us figure this out!
Hi Martin, I just tried the example you gave and I do have a results.csv with one row per trial. Did you observe the issue with mainline? Can you confirm that the issue also happened with the script you gave?
I set up a new SM notebook instance, created a new conda environment, and run the script above:
conda create -n test python=3.9.5 -y
conda activate test
pip install syne-tune
git clone https://github.com/awslabs/syne-tune.git
cd syne-tune
pip install -r requirements-ray.txt
python script.py
from syne_tune.experiments import load_experiment
load_experiment('train-height-2022-03-31-12-34-19-697').results['trial_id']
Again, multiple 1s showed up.
I've run an experiment on a grid with 49 configurations and during the search all were evaluated. The results table has 52 rows, 4 of which have trial id 1. Exactly 49 jobs were executed on SageMaker.
It is not limited to trial id 1. In addition to 1, I also saw 2.
We are working on finding a good setup to reproduce this issue as it happens sporadically.
I might have experienced this issue as well. Here is what happens for me. I am running an experiment with SM backend, ASHA, and lstm_wikitext2 benchmark. There are 10 seeds, 6 fail, 4 succeed.
In the 6 failed ones, this happens:
- trial_id 1 is successful and moves beyond 9 or 27 epochs
- scheduler receives report from trial_id 1 with resource=1, this leads to exception
- in all cases, the next trial_id to report anything at resource=1, is always 10. "10" looks similar to "1" (?)
In the 4 successful ones, trial_id 1 is stopped at 1 or 3, at a time when trial_id 10 does not exist yet.
I'll dig a bit into this. If trial_id's are mixed up, we can detect this easily by passing trial_id to training function and back in the reports.
Hunch: Trial with id 10 is mistaken for trial_id 1. Maybe a path is mismatched. In S3, "XYZ-1" matches to "XYZ-1*", unless you use "XYZ-1/". I'll have a look.
Closing as #374 seems to have addressed the issue.