GaussianAnything Finetuning on a different dataset (i23d and t23d)

I want to finetune this model on my own two datasets. One is from raw single image to 3D meshes, and the other is from text (my own feature embeddings) to 3D meshes. All meshes have the same topology so they can be converted easily to point clouds if needed.

For this, I have been trying to reverse engineer the structure in which I need to keep my data through the list of different classes in GaussianAnything/datasets/g_buffer_objaverse.py, from where I shortlisted class ChunkObjaverseDatasetDDPMgsI23D and class ChunkObjaverseDatasetDDPMgsT23D to be most relevant to me.

The structure I have understood till now is:

/path/to/dataset/  # `file_path`
├── SampleID_A/
│   └── view_00001.png
├── SampleID_B/
│   └── view_00001.png
└── ...
└── captions.json

/path/to/latents/  # `mv_latent_dir`
├── SampleID_A/
│   └── latent.npz
├── SampleID_B/
│   └── latent.npz
└── ...

Here,

view_00001.png is raw RGB or grayscale image
latent.npz looks like:

{
'latent_normalized' : [...],  # pre-computed latent from the model's VAE, using a randomly generated (2, 16, 768) array here for testing right now
'query_pcd_xyz' : [...],  # 3D pcd, (array of shape [N, 3]) loaded from the 3D obj file
}

captions.json looks like:

{
'SampleID_A' : 'Caption_A',
...
}

Is this correct? If not, what changes should I do / how should I structure my data and which dataset class should I use?
Secondly, for generating latent_normalized for each 3D mesh in my dataset, is running the bash shell_scripts/release/inference/vae-3d.sh command as given in 3D VAE Reconstruction appropriate? I am confused since it mentioned "encoding mulit-view 3D renderings to point-cloud structured latent code" but in my case I would be encoding a mesh or pcd into the latent embedding.
Lastly, if I want to use text embeddings that I generate myself, where do I plug them into the code (instead of giving captions to the code)?

Can you please guide me on this? Thank you for the help on my previous issue too!

Mar 26 '25 14:03 Me-AU

I, I will lint the readme and upload all vae & 3d diffusion training dataset this week for better reproduction. Will let you know soon~

Mar 27 '25 15:03 NIRVANALAN

I want to finetune this model on my own two datasets. One is from raw single image to 3D meshes, and the other is from text (my own feature embeddings) to 3D meshes. All meshes have the same topology so they can be converted easily to point clouds if needed.

For this, I have been trying to reverse engineer the structure in which I need to keep my data through the list of different classes in GaussianAnything/datasets/g_buffer_objaverse.py, from where I shortlisted class ChunkObjaverseDatasetDDPMgsI23D and class ChunkObjaverseDatasetDDPMgsT23D to be most relevant to me.

The structure I have understood till now is:
/path/to/dataset/  # `file_path`
├── SampleID_A/
│   └── view_00001.png
├── SampleID_B/
│   └── view_00001.png
└── ...
└── captions.json
/path/to/latents/  # `mv_latent_dir`
├── SampleID_A/
│   └── latent.npz
├── SampleID_B/
│   └── latent.npz
└── ...
Here,

view_00001.png is raw RGB or grayscale image

latent.npz looks like:
{
'latent_normalized' : [...],  # pre-computed latent from the model's VAE, using a randomly generated (2, 16, 768) array here for testing right now
'query_pcd_xyz' : [...],  # 3D pcd, (array of shape [N, 3]) loaded from the 3D obj file
}
captions.json looks like:
{
'SampleID_A' : 'Caption_A',
...
}
Is this correct? If not, what changes should I do / how should I structure my data and which dataset class should I use?

Secondly, for generating latent_normalized for each 3D mesh in my dataset, is running the bash shell_scripts/release/inference/vae-3d.sh command as given in 3D VAE Reconstruction appropriate? I am confused since it mentioned "encoding mulit-view 3D renderings to point-cloud structured latent code" but in my case I would be encoding a mesh or pcd into the latent embedding.

Lastly, if I want to use text embeddings that I generate myself, where do I plug them into the code (instead of giving captions to the code)?

Can you please guide me on this? Thank you for the help on my previous issue too!

I face the same problem, Have you successfully finetune the model ? I tired to read the source code g_buffer_objaverse.py for generate the input tensor as paper, Image $HxWx3$, Plucker embedding $P \in R^{HxWx6}$ and normal map $N \in R^{HxWx3}$ , depth $D \in R^{HxWx6}$ . However, I cannot find the way to generate the normal map $N$. Could you please give me more description of c2w ? Is it mean ?

| R11 R12 R13 Tx |
| R21 R22 R23 Ty |
| R31 R32 R33 Tz |
| 0   0   0   1  |

Do you have any idea or more description ?@NIRVANALAN @Me-AU

Aug 11 '25 09:08 kehuantiantang

Hi, you mean loading the normal map? The normal map is directly loaded from the local disk and there is no need to generate it.

https://github.com/NIRVANALAN/GaussianAnything/blob/3cf3fdefc9d7e5a2f0088434033cf46cb7fc217b/datasets/g_buffer_objaverse.py#L3291

c2w means camera to world matrix.

You can run the VAE encoding script in the readme to see how the image, plucker, normal, and depth are loaded from the local disk.

Aug 13 '25 18:08 NIRVANALAN

g_buffer_objaverse.py Thank you so much for your reply, I wonder to known whether is possible to generate normal map only according to the camera extrinsics and intrinsics ? For my case, my data is hard to collect the depth map. Also, is it necessary of normalization operation for Plucker map ? Thank you so much for your suggession.

Aug 14 '25 06:08 kehuantiantang

There is no need to normalize plucker map.

If you do not include normal for vae training, it should also be fine according to my experience.

Aug 17 '25 18:08 NIRVANALAN