lfads-torch
lfads-torch copied to clipboard
Ray tune warnings and 'metric' not reported in result (ValueError)
Hi Andrew,
Trying to run multisession example and getting some warnings initially and then a value error, can you please recommend any debug steps? I am not familiar with Ray tune for training.
Getting these warnings initially, not sure if anything is broken due to these:
2024-01-26 11:26:57,399 INFO worker.py:1528 -- Started a local Ray instance.
C:\Users\anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py:609: DeprecationWarning: checkpoint_dir
in func(config, checkpoint_dir)
is being deprecated. To save and load checkpoint in trainable functions, please use the ray.air.session
API:
from ray.air import session
def train(config): # ... session.report({"metric": metric}, checkpoint=checkpoint)
For more information please see https://docs.ray.io/en/master/tune/api_docs/trainable.html
warnings.warn( 2024-01-26 11:26:59,210 WARNING trial_runner.py:1604 -- You are trying to access _search_alg interface of TrialRunner in TrialScheduler, which is being restricted. If you believe it is reasonable for your scheduler to access this TrialRunner API, please reach out to Ray team on GitHub. A more strict API access pattern would be enforced starting 1.12s.0****
Value error that terminates the script. If any other metric in Result (for ex. timestamp) is used, it proceeds from this step but fails eventually due to some other dependency on 'cur_epoch' metric for tuning:
ValueError: Trial returned a result which did not include the specified metric(s) valid/recon_smth
that tune.TuneConfig()
expects. Make sure your calls to tune.report()
include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: {'trial_id': 'dfd00_00000', 'experiment_id': 'e5eb8f5c73b546ee9bef65bb16997574', 'date': '2024-01-26_11-27-03', 'timestamp': 1706297223, 'pid': 95292, 'hostname': 'DESKTOP', 'node_ip': '127.0.0.1', 'done': True, 'config/datamodule': 'BMI_multisession_PCR', 'config/model': 'BMI_multisession_PCR', 'config/logger.wandb_logger.project': 'BMI', 'config/logger.wandb_logger.tags.0': 'BMI_multisession_PCR', 'config/logger.wandb_logger.tags.1': 'version_240126112654', 'config/model.lr_init': 0.001, 'config/model.dropout_rate': 0.3511779084499725, 'config/model.train_aug_stack.transforms.0.cd_rate': 0.5, 'config/model.kl_co_scale': 0.0001115416382089259, 'config/model.kl_ic_scale': 0.00010476283727212514, 'config/model.l2_gen_scale': 0.5024837234461056, 'config/model.l2_con_scale': 0.1221168826037272}
This is conda setup on Windows system, so it did require some config file path changes to absolute paths, instead of relative paths.