gaussian-splatting icon indicating copy to clipboard operation
gaussian-splatting copied to clipboard

Why do all images have to be on GPU?

Open inuex35 opened this issue 11 months ago • 7 comments

Why is it necessary to keep images on the GPU? I have recently made a modification to load images onto the GPU only during training, and I can train with 8K images using only about 11GB of GPU memory with simple modification.

In research papers, it is necessary to do so in order to reduce training time, but for practical use, doing it this way reduces VRAM usage.

Are there any other reasons for keeping images on the GPU?

inuex35 avatar Mar 23 '24 19:03 inuex35

I would be interested in using your method, will you please include changes and what lines you made changes too?

bennmann avatar Mar 24 '24 01:03 bennmann

Please see the latest commit below.

I'm using a 16GB GPU, and while it was running out of memory with 8K images, it seems to be working with 5K images. However the training takes longer than the original. https://github.com/graphdeco-inria/gaussian-splatting/commit/2cb880a8980ff69c1e5dc0ab9c4f8d0cd75aa0a7

inuex35 avatar Mar 24 '24 04:03 inuex35

I was also seeing this. Did you find out is it really neccessary to store RGB images on GPU? As I saw in some other projects that are using this work, they are storing RGB images on GPU but depth on CPU, so I just want to confirm is anything going to lead to wrong results or similar if I don't store them on GPU?

robofar avatar Mar 26 '24 07:03 robofar

Just sharing some thoughts about this issue. There is an input argument (--data_device cpu) that you can use CPU thus storing the images using RAM. However, even using a CPU, readColmapCameras function will load all images at once to the RAM using Image.open and store images in CameraInfo. I feel it inconvenient when I running large number of images, for example 5k images.

yuzheng-cosmos avatar Mar 27 '24 13:03 yuzheng-cosmos

But during the training there is only one image that is loaded on the GPU not all the images ? If I am not mistaken ?

MatteoMarengo avatar Jun 27 '24 09:06 MatteoMarengo

My understanding is that all images are loaded into VRAM (or RAM if using --data_device cpu) at once by readColmapCameras, but during the actual training loop we randomly sample the training image ONE at a time during each iteration to perform the gradient updates to the gaussians.

voidJeff avatar Jul 02 '24 23:07 voidJeff

It just makes it slower if you don't load everything to VRAM/RAM. No impact on fidelity.

To change this in 3DGS you need to make these changes in scene/dataset_readers.py:

  1. Comment out image: np.array in CameraInfo's definition
class CameraInfo(NamedTuple):
    uid: int
    R: np.array
    T: np.array
    FovY: np.array
    FovX: np.array
    # image: np.array
    image_path: str
    image_name: str
    width: int
    height: int
  1. Remove the image loading part from readCamerasFromTransforms and hard-code/pre-process your data so you can still pass in image width & height
def readCamerasFromTransforms(path, transformsfile, white_background, extension=".png"):
    cam_infos = []

    with open(os.path.join(path, transformsfile)) as json_file:
        contents = json.load(json_file)
        fovx = contents["angle_x"]

        frames = contents["frames"]

        for idx, frame in enumerate(frames):
           
            zfilled_idx = str(idx).zfill(6)
            cam_name = os.path.join(path, "frames") + f"/frame_{zfilled_idx}{extension}"
            # cam_name = str(idx)

            # NeRF 'transform_matrix' is a camera-to-world transform
            c2w = np.array(frame["transform_matrix"])
            # change from OpenGL/Blender camera axes (Y up, Z back) to COLMAP (Y down, Z forward)
            c2w[:3, 1:3] *= -1

            # get the world-to-camera transform and set R, T
            w2c = np.linalg.inv(c2w)
            R = np.transpose(w2c[:3,:3])  # R is stored transposed due to 'glm' in CUDA code
            T = w2c[:3, 3]

            image_path = os.path.join(path, cam_name)
            image_name = Path(cam_name).stem
            width = 2000
            height = 2000
            bg = np.array([1,1,1]) if white_background else np.array([0, 0, 0])

            fovy = focal2fov(fov2focal(fovx, width), height)
            FovY = fovy 
            FovX = fovx

            # TODO: Don't send img here... Open it when taking the loss/rendering.
            cam_infos.append(CameraInfo(uid=idx, R=R, T=T, FovY=FovY, FovX=FovX, 
                            image_path=image_path, image_name=image_name, width=width, height=height))
            
    return cam_infos
  1. In train.py under training.py, change gt_image = viewpoint_cam.original_image.cuda() to
norm_data = np.array(Image.open(os.path.join(viewpoint_cam.image_path, viewpoint_cam.image_name)).convert("RGBA")) / 255.

bg = np.array([1,1,1]) if dataset.white_background else np.array([0, 0, 0])

arr = norm_data[:,:,:3] * norm_data[:, :, 3:4] + bg * (1 - norm_data[:, :, 3:4])

image = Image.fromarray(np.array(arr*255.0, dtype=np.byte), "RGB").cuda()

(don't forget to import numpy and PIL in train.py)

This should be enough! (assuming you have the frames stored somewhere and not as an npy/compressed file)

Aryan-Garg avatar Sep 16 '24 19:09 Aryan-Garg