CogVideo
CogVideo copied to clipboard
prepare_rotary_positional_embeddings training and inference are different
您好,請問為什麼在lora_trainer.py中prepare_rotary_positional_embeddings是如下code
def prepare_rotary_positional_embeddings(
self,
height: int,
width: int,
num_frames: int,
transformer_config: Dict,
vae_scale_factor_spatial: int,
device: torch.device,
) -> Tuple[torch.Tensor, torch.Tensor]:
grid_height = height // (vae_scale_factor_spatial * transformer_config.patch_size)
grid_width = width // (vae_scale_factor_spatial * transformer_config.patch_size)
if transformer_config.patch_size_t is None:
base_num_frames = num_frames
else:
base_num_frames = (num_frames + transformer_config.patch_size_t - 1) // transformer_config.patch_size_t
freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
embed_dim=transformer_config.attention_head_dim,
crops_coords=None,
grid_size=(grid_height, grid_width),
temporal_size=base_num_frames,
grid_type="slice",
max_size=(grid_height, grid_width),
device=device,
)
inference的時候在pipeline_cogvideox_image2video.py卻是如下code
def _prepare_rotary_positional_embeddings(
self,
height: int,
width: int,
num_frames: int,
device: torch.device,
) -> Tuple[torch.Tensor, torch.Tensor]:
grid_height = height // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
grid_width = width // (self.vae_scale_factor_spatial * self.transformer.config.patch_size)
p = self.transformer.config.patch_size
p_t = self.transformer.config.patch_size_t
base_size_width = self.transformer.config.sample_width // p
base_size_height = self.transformer.config.sample_height // p
if p_t is None:
# CogVideoX 1.0
grid_crops_coords = get_resize_crop_region_for_grid(
(grid_height, grid_width), base_size_width, base_size_height
)
freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
embed_dim=self.transformer.config.attention_head_dim,
crops_coords=grid_crops_coords,
grid_size=(grid_height, grid_width),
temporal_size=num_frames,
device=device,
)
else:
# CogVideoX 1.5
base_num_frames = (num_frames + p_t - 1) // p_t
freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
embed_dim=self.transformer.config.attention_head_dim,
crops_coords=None,
grid_size=(grid_height, grid_width),
temporal_size=base_num_frames,
grid_type="slice",
max_size=(base_size_height, base_size_width),
device=device,
)
想請問一下training跟inference哪個是對的呢
pipeline里的代码由于一些历史原因写起来有点啰嗦,建议以training的代码作为参考。虽然实现方式有区别,但是最后的结果都是一样的。
想請問你們當初在訓練現在CogVideoX-5b-I2V 跟 CogVideoX1.5-5B-I2V的pretrain weight時是用training的code還是inference的code ?
pipeline里的代码由于一些历史原因写起来有点啰嗦,建议以training的代码作为参考。虽然实现方式有区别,但是最后的结果都是一样的。
@OleehyO 同样的疑问,看上去training的get_3d_rotary_pos_embed中是根据当前的宽高生成的位置编码,而pipeline中的1.0版本是根据base的宽高做了crop和插值,1.5版本貌似是基于base_size的宽高做了外推。如果当前的宽高不等于base的宽高,这三种方式得到的位置编码应该都是不一样的。能解释一下吗
@cdfan0627 有测试过这里的影响吗