zero123plus icon indicating copy to clipboard operation
zero123plus copied to clipboard

How do camera viewpoints work at training and inference?

Open QiuhongAnnaWei opened this issue 1 year ago • 3 comments

Thanks for releasing the code!

I am trying to understand how the camera viewpoints are sampled and used, and I have a few questions:

  1. How exactly does the model take in the camera viewpoint? Is it the same as zero123 conditional latent diffusion architecture where the input view (x) and a relative viewpoint transformation (R,T) are used as conditional information? If so, are you using the same conditional info encoder as zero123?

  2. The report says Zero123++ uses a fixed set of 6 poses (relative azimuth and absolute elevation angles) as the prediction target. a. zero123 uses a dataset of paired images and their relative camera extrinsics {(x, x_(R,T) , R, T)} for training, is the equivalent notation for zero123++ {(x, x_(tiled 6 images) , R_{1,...6}, T_{1..6})} b. Tying back to Q1, does this mean instead of taking in (x) and (R,T) as conditional input, zero123++ takes in (x_{1...6}) and (R_{1...6}, T_{1..6}) as conditional input?

  3. I hope to explicitly pass in a randomly sampled camera viewpoint at inference time, is that possible? I couldn't seem to find the exact part in the code that will allow this.

QiuhongAnnaWei avatar Oct 26 '23 16:10 QiuhongAnnaWei

If the answer to 3) is it's not possible -- is the plan to release training code + dataset for folks to make their own view set (e.g. a 360 orbit)?

avaer avatar Oct 26 '23 16:10 avaer

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any. By sampling twice with different camera parameters as input we will not get consistent results, so we previously did not think it is helpful.

See #10 for comments on training code and camera pose conditioning.

eliphatfs avatar Oct 26 '23 17:10 eliphatfs

eliphatfs said:

We do not explicitly use any camera pose input during training or inference. It is just that the designed output views do not have any ambiguity given the input image so we do not need any.

=> By "not explicitly use any camera pose input during training",

do you mean that, as a training pair, you use

(cond_image_i, target_grid), i=1,12, for each mesh.

Here target_grid consists of 6 images obtained by rendering a given mesh using 6 camera positions with fixed absolute elevation angles and relative aznimuth angles. cond_image_i refers to the object obtained by rendering the mesh with the ith randomly chosen camera position. The number of cond_images, 12, is arbitrary.

moonryul avatar Sep 21 '24 07:09 moonryul