EAT_code
EAT_code copied to clipboard
Details about CLIP fine-tuning and zero-shot text-guided editing
Hi,
Could you kindly provide more details on the setting for model fine-tuning with CLIP and the zero-shot text-guided expression editing procedure?
For model fine-tuning with CLIP, my understanding is that: the same losses in emotion adaptation are used in addition to CLIP loss; the fine-tuning is performed on MEAD, where each training video is paired with fixed text prompts of the corresponding emotion category (attached in the screenshot).
For the zero-shot text-guided expression editing, I was wondering how is the CLIP text feature incorporated into the existing model structure (e.g. a mapping from CLIP feature to the latent code z or to the emotion prompt?).
Thank you in advance for your time and help!
Originally posted by @JamesLong199 in https://github.com/yuangan/EAT_code/issues/23#issuecomment-2110126521