H-TSP train和evaluate报错

train.py 和config_ppo.yaml 中low_level_load_path是如何生成的 evaluate.py中设置了lower_model和upper_model 报错Encoder type cnn not supported! 4个upper_model都试过了加载后的encoder_type是cnn而非pixel 有更详细的训练或者验证介绍嘛

Jun 20 '24 03:06 laborer123

train.py 和config_ppo.yaml 中low_level_load_path是如何生成的

low_level_load_path需要设置为训练好的lower level model，或者也可以参照 rl4cop 文件夹中的 readme 文档训练一个新的下层模型。

报错Encoder type cnn not supported! 4个upper_model都试过了加载后的encoder_type是cnn而非pixel

这一问题的原因是我们曾经尝试过多种不同的CNN结构，并最终选定了一种结构。然而，代码中的命名更改导致判断部分不兼容。现已修改代码以兼容训练好的checkpoint。你可以拉取最新代码，或手动修改h_tsp.py文件中的相关判断代码。

如有其他问题，请随时提出。

Jun 20 '24 12:06 neo-pan

感谢回复现在碰到新的问题evaluate.py能跑通，train.py在加上low_level_model训练的时候出错

1.同时训练low_level_model时出错 1） https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/h_tsp.py#L1299 这行代码在较新的pytorch_lightning即1.5.6版本开始修改了原来的：

 def attach_model_logging_functions(self, model):
        for callback in self.trainer.callbacks:
            callback.log = model.log
            callback.log_dict = model.log_dict

现在的

def _attach_model_logging_functions(self) -> None:
        lightning_module = self.trainer.lightning_module
        for callback in self.trainer.callbacks:
            callback.log = lightning_module.log
            callback.log_dict = lightning_module.log_dict

即使我改成旧的，在 https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/rl4cop/train_path_solver.py#L416 这行代码中会报错，"You are trying to self.log() but it is not managed by the Trainer control flow"，似乎是low_level_model的callback不能直接拷贝，在不用旧版本的情况下有什么好的办法处理这个问题吗，旧版的一些库兼容性太差了 2）我根据requirements大致创建出旧版环境，有些库似乎冲突，但能正常跑了，但是警告 /opt/conda/envs/opl/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py:415: UserWarning: You are trying to self.log() but the self.trainer reference is not registered on the model yet. This is most likely because the model hasn't been passed to the Trainer 似乎low_level_model又没能训练到 2.我用的单卡3090训练，一个epoch要10个小时，这个速度算正常的嘛

Jun 26 '24 06:06 laborer123

这个log的问题在新的pytorch-lightning版本中得到了修复： #13638，现在你可以使用以下的setter为low_level_model设置trainer：

https://github.com/Lightning-AI/pytorch-lightning/blob/2524864b3cc8dae3552a4b5c3c819c2ce6f278af/src/lightning/pytorch/core/module.py#L223-L228

另外关于训练时间，一个epoch要10小时应该不太正常，我之前的实验使用V100训练一个epoch的时间大约是15分钟左右，你可以检查一下哪一部分耗时较多，也许和不同版本之前的差别有关。

Jun 30 '24 11:06 neo-pan

这个用法不太熟悉，我在HTSP_PPO类和PathSolver类分别试了加入，

@pl.LightningModule.trainer.setter
    def trainer(self, trainer: Optional["pl.Trainer"]) -> None:
        for v in self.children():
            if isinstance(v, pl.LightningModule):
                v.trainer = trainer  # type: ignore[assignment]
        self._trainer = trainer

https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/h_tsp.py#L1299 改写成 self.low_level_model.trainer._callback_connector._attach_model_logging_functions() 均不成功，能指导一下嘛

我用旧版本的环境和新版本的环境初步检查到时间均耗在这，按默认配置的参数训练的 https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/h_tsp.py#L1114 其中decoder部分耗时长，循环次数累加起来的 https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/rl4cop/train_path_solver.py#L262 其次是get_fragment_knn和env.step耗时 https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/h_tsp.py#L754 https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/rl4cop/train_path_solver.py#L257

Jul 01 '24 06:07 laborer123

这个赋值 trainer 的用法是我在搜索相关问题时找到的，在我此前的代码中也未使用过。经过测试，我发现修改这部分代码相对麻烦。为了正常训练，有一个简单的变通方法，就是在训练时临时注释掉 low_level_model 中与日志记录相关的代码（https://github.com/Learning4Optimization-HUST/H-TSP/blob/ee91c49cc7bfd9fb76ae7e7f5a0631877c262675/rl4cop/train_path_solver.py#L416C1-L433C10）

另外，我又检查了之前的实验记录，此前为了便于记录结果，我将 epoch_size 设置为4，而现在的默认参数中 epoch_size 设置为200。因此，训练时间的增加是正常的。

Jul 04 '24 12:07 neo-pan

H-TSP H-TSP copied to clipboard

train和evaluate报错

H-TSP
H-TSP copied to clipboard