d3fields
d3fields copied to clipboard
[arXiv] D^3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
Website | Paper | Colab | Doc
D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
Yixuan Wang1*, Zhuoran Li2, 3*, Mingtong Zhang1, Katherine Driggs-Campbell1, Jiajun Wu2, Li Fei-Fei2, Yunzhu Li1, 2
1University of Illinois Urbana-Champaign,
2Stanford University,
3National University of Singapore
https://github.com/WangYixuan12/d3fields/assets/32333199/a3fced3d-e827-4e7e-ad6a-e80889809fca
Try it in Colab!
In this notebook, we show how to build D3Fields and visualize reconstructed mesh, mask fields, and descriptor fields. We also demonstrate how to track keypoints of a video.
Installation
We recommend Mambaforge instead of the standard anaconda distribution for faster installation:
# create conda environment
mamba env create -f env.yaml
conda activate d3fields
# download pretrained models
bash scripts/download_ckpts.sh
bash scripts/download_data.sh
Visualization
python vis_repr.py # visualize the representation
python vis_tracking.py # visualize the tracking
Code Explanation
Fusion
is the core class of D3Fields. It contains the following key functions:
-
update
: it takes in the observation and updates the internal states. -
text_queries_for_inst_mask
: it will query the instance mask according to the text query and thresholds. -
text_queries_for_inst_mask_no_track
: it is similar totext_queries_for_inst_mask
, but it will not invoke the underlying XMem tracking module. -
eval
: it will evaluate associated features for arbitrary 3D points. -
batch_eval
: for a large batch of points, it will evaluate them batch by batch to avoid out-of-memory error. The important attributes ofFusion
are: -
curr_obs_torch
: a dictionary containing the following keys:-
color
: multiview color images in the format of np.uint8 BGR numpy arrays -
color_tensor
: multiview color images in the format of float32 BGR torch tensors -
depth
: multiview depth images in the format of np.float32 torch tensors, unit in meters -
mask
: multiview instance mask images in the format of np.uint8 torch tensors (V, H, W, num_inst) -
consensus_mask_label
: mask labels aggregated from all views in the format of a list of strings.
-
Customized Dataset
To run D3Fields on your own dataset, you could follow the following steps:
- Prepare dataset in the following structure:
dataset_name
├── camera_0
│ ├── color
| | ├── 0.png
| | ├── 1.png
| | ├── ...
│ ├── depth
| | ├── 0.png
| | ├── 1.png
| | ├── ...
│ ├── camera_extrinsics.npy
│ ├── camera_params.npy
├── camera_1
├── ...
The definition of camera_extrinsics.npy
and camera_params.npy
is defined as follows:
camera_extrinsics.npy: (4, 4) numpy array, the extrinsics of the camera, which transforms a point from world coordinate to camera coordinate
camera_params.npy: (4,) numpy array, the camera parameters in the following order: fx, fy, cx, cy
- Prepare the PCA pickle file for the query texts. Find four images of the queries texts (e.g. mug) with clean bakcground and central objects. Change
obj_type
withinscripts/prepare_pca.py
and run it. - Specify the workspace boundary as x_lower, x_upper, y_lower, y_upper, z_lower, z_upper.
- Run
python vis_repr_custom.py
, such aspython vis_repr_custom.py --data_path data/2023-09-15-13-21-56-171587 --pca_path pca_model/mug.pkl --query_texts mug --query_thresholds 0.3 --x_lower -0.4 --x_upper 0.4 --y_upper 0.3 --y_lower -0.4 --z_upper 0.02 --z_lower -0.2
Tips for debugging:
- Make sure the transformation is right by visualizing
pcd
withinvis_repr_custom.py
using Open3D. - If the GPU is out of memory, run
vis_repr_custom.py
with smallerstep
. This will generate a more sparse voxel grid. - Make sure Grounded SAM outputs reasonable results by checking
curr_obs_torch['mask']
andcurr_obs_torch['consensus_mask_label']
ofFusion
class.
Citation
If you find this repo useful for your research, please consider citing the paper
@article{wang2023d3fields,
title={D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation},
author={Wang, Yixuan and Li, Zhuoran and Zhang, Mingtong and Driggs-Campbell, Katherine and Wu, Jiajun and Fei-Fei, Li and Li, Yunzhu},
journal={arXiv preprint arXiv:2309.16118},
year={2023}
}