DiffPortrait3D I am curious about how the Reference Net works.

I am curious about how the Reference Net works.

Open cvipym opened this issue 1 year ago • 0 comments

Thank you for sharing your amazing code.

I have a question and would like to leave it here.

I was curious about how the Appearance Ref works.

So, I looked into how the "image_control" in the condition dictionary (represented as variable c in the code) works.

# inference.py
    for i in range(conditions.shape[0] // nSample):
        print("Generate Image {} in {} images".format(nSample * i, conditions.shape[0])) 
        inpaint = None
        if args.denoise_from_fea_map:
            fea_map_enc = infer_model.get_first_stage_encoding(infer_model.encode_first_stage(fea_condtion[i*nSample: i*nSample+nSample]))
            c = {"c_concat": [conditions[i*nSample: i*nSample+nSample]], "c_crossattn": [c_cross], "image_control": cond_img_cat, 'feature_control':fea_map_enc}
            if args.control_mode == "controlnet_important":
                uc = {"c_concat": [conditions[i*nSample: i*nSample+nSample]], "c_crossattn": [uc_cross]}
            else:
                uc = {"c_concat": [conditions[i*nSample: i*nSample+nSample]], "c_crossattn": [uc_cross], "image_control": cond_img_cat}
            c['wonoise'] = True
            uc['wonoise'] = True

At this point, I discovered that in the function p_sample_ddim of the class DDIMSampler_ReferenceOnly, cond_image_start is concatenated with the timestep to become reference_image_noisy.

def p_sample_ddim(
...
        if 'image_control' in c and c['image_control'] is not None:
            cond_image_start = torch.cat(c['image_control'], 1)
            # cond_image_start = self.model.get_first_stage_encoding(self.model.encode_first_stage(cond_image_hint))
            if c['wonoise']:
                reference_image_noisy = cond_image_start
            else:
                reference_image_noisy = self.model.q_sample(cond_image_start,t)
...
                model_uncond = self.model.apply_model(x_in, t_in, c_in, None, uc=True)

This reference_image_noisy is also an input to the function apply_model of the class LatentDiffusionReferenceOnly.

However, looking at the code, it seems that reference_image_noisy is not being utilized.

    def apply_model(self, x_noisy, t, cond, reference_image_noisy=None ,return_ids=False):
        if isinstance(cond, dict):
            # hybrid case, cond is expected to be a dict
            pass
        else:
            if not isinstance(cond, list):
                cond = [cond]
            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
            cond = {key: cond}

        x_recon = self.model(x_noisy, t, **cond)

        if isinstance(x_recon, tuple) and not return_ids:
            return x_recon[0]
        else:
            return x_recon

I am curious about how reference_image_noisy serves the role of an appearance reference.

Jul 16 '24 13:07 cvipym

DiffPortrait3D DiffPortrait3D copied to clipboard

I am curious about how the Reference Net works.

DiffPortrait3D
DiffPortrait3D copied to clipboard