gaussian-splatting
                                
                                 gaussian-splatting copied to clipboard
                                
                                    gaussian-splatting copied to clipboard
                            
                            
                            
                        Why do all images have to be on GPU?
Why is it necessary to keep images on the GPU? I have recently made a modification to load images onto the GPU only during training, and I can train with 8K images using only about 11GB of GPU memory with simple modification.
In research papers, it is necessary to do so in order to reduce training time, but for practical use, doing it this way reduces VRAM usage.
Are there any other reasons for keeping images on the GPU?
I would be interested in using your method, will you please include changes and what lines you made changes too?
Please see the latest commit below.
I'm using a 16GB GPU, and while it was running out of memory with 8K images, it seems to be working with 5K images. However the training takes longer than the original. https://github.com/graphdeco-inria/gaussian-splatting/commit/2cb880a8980ff69c1e5dc0ab9c4f8d0cd75aa0a7
I was also seeing this. Did you find out is it really neccessary to store RGB images on GPU? As I saw in some other projects that are using this work, they are storing RGB images on GPU but depth on CPU, so I just want to confirm is anything going to lead to wrong results or similar if I don't store them on GPU?
Just sharing some thoughts about this issue. There is an input argument (--data_device cpu) that you can use CPU thus storing the images using RAM. However, even using a CPU, readColmapCameras function will load all images at once to the RAM using Image.open and store images in CameraInfo. I feel it inconvenient when I running large number of images, for example 5k images.
But during the training there is only one image that is loaded on the GPU not all the images ? If I am not mistaken ?
My understanding is that all images are loaded into VRAM (or RAM if using --data_device cpu) at once by readColmapCameras, but during the actual training loop we randomly sample the training image ONE at a time during each iteration to perform the gradient updates to the gaussians.
It just makes it slower if you don't load everything to VRAM/RAM. No impact on fidelity.
To change this in 3DGS you need to make these changes in scene/dataset_readers.py:
- Comment out image: np.arrayin CameraInfo's definition
class CameraInfo(NamedTuple):
    uid: int
    R: np.array
    T: np.array
    FovY: np.array
    FovX: np.array
    # image: np.array
    image_path: str
    image_name: str
    width: int
    height: int
- Remove the image loading part from readCamerasFromTransformsand hard-code/pre-process your data so you can still pass in image width & height
def readCamerasFromTransforms(path, transformsfile, white_background, extension=".png"):
    cam_infos = []
    with open(os.path.join(path, transformsfile)) as json_file:
        contents = json.load(json_file)
        fovx = contents["angle_x"]
        frames = contents["frames"]
        for idx, frame in enumerate(frames):
           
            zfilled_idx = str(idx).zfill(6)
            cam_name = os.path.join(path, "frames") + f"/frame_{zfilled_idx}{extension}"
            # cam_name = str(idx)
            # NeRF 'transform_matrix' is a camera-to-world transform
            c2w = np.array(frame["transform_matrix"])
            # change from OpenGL/Blender camera axes (Y up, Z back) to COLMAP (Y down, Z forward)
            c2w[:3, 1:3] *= -1
            # get the world-to-camera transform and set R, T
            w2c = np.linalg.inv(c2w)
            R = np.transpose(w2c[:3,:3])  # R is stored transposed due to 'glm' in CUDA code
            T = w2c[:3, 3]
            image_path = os.path.join(path, cam_name)
            image_name = Path(cam_name).stem
            width = 2000
            height = 2000
            bg = np.array([1,1,1]) if white_background else np.array([0, 0, 0])
            fovy = focal2fov(fov2focal(fovx, width), height)
            FovY = fovy 
            FovX = fovx
            # TODO: Don't send img here... Open it when taking the loss/rendering.
            cam_infos.append(CameraInfo(uid=idx, R=R, T=T, FovY=FovY, FovX=FovX, 
                            image_path=image_path, image_name=image_name, width=width, height=height))
            
    return cam_infos
- In train.pyundertraining.py, changegt_image = viewpoint_cam.original_image.cuda()to
norm_data = np.array(Image.open(os.path.join(viewpoint_cam.image_path, viewpoint_cam.image_name)).convert("RGBA")) / 255.
bg = np.array([1,1,1]) if dataset.white_background else np.array([0, 0, 0])
arr = norm_data[:,:,:3] * norm_data[:, :, 3:4] + bg * (1 - norm_data[:, :, 3:4])
image = Image.fromarray(np.array(arr*255.0, dtype=np.byte), "RGB").cuda()
(don't forget to import numpy and PIL in train.py)
This should be enough! (assuming you have the frames stored somewhere and not as an npy/compressed file)