gpu.js icon indicating copy to clipboard operation
gpu.js copied to clipboard

Large arrays fail to process, depending on system GPU [graphical mode]

Open InkLabApp opened this issue 4 years ago • 5 comments

What is wrong?

When using GPU.js in graphical mode, the maximum working size of the canvas used is limited by the hardware on the host system. If the GPU does not support a large enough size, the program will crash instead of using the CPU fallback, or partitioning the calculations.

Where does it happen?

It happens when using google chrome

How do we replicate the issue?

The contents of the kernel to run is not important, just the size of the canvas. In my case I first noticed the issue using a canvas of size 4200x4200 (From the open source web comic Pepper and Carrot) I received the error message "Argument texture height and width of 4200 larger than maximum size of 4096 for your GPU". After experimenting, I found that this maximum size is different for each computer. Another one of my machines had a limitation of ~16000 instead. I have attached a sample file, where the const variables 'width' and 'height' can be increased until the error appears.

large_canvas_bug_gpujs.html.zip

How important is this (1-5)?

4 - I would say it is important to me because it completely prevents using gpu.js on some systems unpredictably. I would say that a 4200 size canvas is not unreasonably large for modern systems to support.

Expected behavior (i.e. solution)

At this stage, a working though very inefficient solution would be to use CPU computation on oversize canvases, as this is indicated as a built in feature of this library. However, the better solution would be to partition the input canvas into slices depending on the maximum supported canvas size, and run the GPU based calculation over each slice to build up the output.

Other Comments

O.o

InkLabApp avatar May 03 '20 19:05 InkLabApp

If possible with your use-case, you can always make the canvas render at a smaller size than its dom element. Set the width and height of the canvas to what you want to render at, and the width and height of the style of the canvas to what size you want it displayed at. Also make sure all your kernels output the same smaller size.

For example in my own case I have this:

// canvas size = the smaller size we want to render at
// window size = the size of the entire window that we want to display at

const draw = gpu.createKernel(k.drawGpu)
    .setOutput(canvasSize).setGraphical(true)

<canvas width={canvasSize[0]} height={canvasSize[1]}
    style={{ position: "absolute", width: windowSize[0], height: windowSize[1] }} ... />

RobinKa avatar May 04 '20 19:05 RobinKa

@RobinKa

Thank you, I appreciate your suggestion. I agree that it may be feasible or in many cases desirable to use a smaller canvas for real elements which are output to the browser DOM, however, my use case involves reading and writing OpenRaster files. In the case when saving a file I need to create a merged image as part of the output. Here I should not downsize the merged image according to the spec. (this all happens offscreen, but is still technically graphical mode)

InkLabApp avatar May 04 '20 20:05 InkLabApp

Yeah I agree with @InkLabApp on this! I faced a similar issue trying to use GPU to average 300+ matrices of 1280x720 images. Just providing that much data to the kernel gives that error:

Argument width of 33260 larger than maximum size of 16384 for your GPU

I'm guessing each GPU will have different maximum sizes (what even is this value?). For reference I'm running an NVidia GTX 1070 MaxQ so relatively high-end, I would assume lower end GPUs could have lower values (whatever this value actually means or comes from).

So I ended up writing my own function to split the workload into smaller chunks that the GPU will be able to handle.

The algorithm is essentially divide and conquer, split the input into many small pieces the GPU will have no problem with, in my case that was chunks of 20 matrices, each matrix being 1280x720 (you must test that this size is suitable for your use case, the max I can get is 72 but this will also depend on matrix size, if you're processing 4K that number will be lower, 10-20 is very safe).

Then for each chunk (chunk being an array of matrices) run the kernel function, so in my case (300 matrices) a total of 15 kernel function runs (300 matrices divided into 20 matrix chunks => 15 runs). You also have to be careful with any remaining chunks if your input was not directly divisible by 20, for example an input of 310 will need 16 kernel function runs with the last run only having 10 matrices to loop over.

The results from the kernel function runs are contained in another array which will then also be run on the kernel function again to get the final average.

I'll put my code below for anyone Googling this and is interested in how I solved it, it's still missing one part though, notably that if the last run of the accumulated kernel function runs is still too big to fit. In that case maybe some sort of recursive call is the solution.

@InkLabApp maybe you could do something like that in your use case? Divide the massive 4200x4200 image into many smaller images, then perform the kernel on those images and then combine the results at the end on the CPU. It's worth trying if the GPU will give you the same performance improvements it gave me!

Anyway, my point of all this is to show how this is really inconvenient on the programmer to have to deal with if they expect it to just work out of the box, I think this should at the very least be documented somewhere or have a utility that will divide the workload provided to a kernel function efficiently (better than how I did it). Also not to mention the major performance loss in having to switch contexts multiple times, though I don't mind this, others might.

Pinging @robertleeplummerjr to look into this. Other than that though, excellent library once you get it working! What took approximately 60-100 seconds on the CPU now takes around 4-6 seconds on the GPU! Truly terrific stuff!

My workaround is below for who is interested but note it's in TypeScript and uses a couple of utility functions I made for readability as well as a bunch of logging statements:


averageEdgesGPU(mats: number[][][], chunkSize: number = 20) {
        const totalMats: number = mats.length
        const height: number = mats[0].length
        const width: number = mats[0][0].length

        logD(`totalMats: ${totalMats}`)
        logD(`height: ${height}`)
        logD(`width: ${width}`)

        const kernelFunc: KernelFunction = function (mats: number[][][], length: number): number {
            const x = this.thread.x
            const y = this.thread.y!!
            let sum = 0
            for (let i = 0; i < length; i++) {
                sum += mats[i][y][x]
            }
            return sum / length
        }

        const kernel: IKernelRunShortcut = gpu.createKernel(kernelFunc)
            .setOutput([width, height])

        const chunksNumber = (totalMats / chunkSize).floor()

        const matsChunks: number[][][][] = []

        from(0).to(chunksNumber).forEach(chunk => {
            const matChunk: number[][][] = []
            from(0).to(chunkSize).forEach(mat => {
                const index = (chunk * chunkSize) + mat
                matChunk.push(mats[index])
            })
            matsChunks.push(matChunk)
        })

        logD(`Chunks before remainder are ${matsChunks.length}`)

        // Remainder mats in the last chunks

        logD(`Remaining ${totalMats % chunkSize}`)

        if (totalMats % chunkSize !== 0) {
            const lastChunk: number[][][] = []
            from(chunksNumber * chunkSize).to(totalMats).forEach(index => {
                lastChunk.push(mats[index])
            })
            matsChunks.push(lastChunk)
        }

        logD(`Chunks after remainder are ${matsChunks.length}`)

        const averagedMats: number[][][] = []

        matsChunks.forEach((chunk: number[][][]) => {
            averagedMats.push(kernel(chunk, chunk.length) as number[][])
        })

        // What if averagedMats is still too big??
        // Then we need to recurse probably :/

        const finalAveragedMat: number[][] = kernel(averagedMats, averagedMats.length) as number[][]

        logD(`Averaged Height: ${finalAveragedMat.length}`)
        logD(`Averaged Width: ${finalAveragedMat[0].length}`)

        const final: number[][][] = Array.init(finalAveragedMat.length, (yIndex) =>
            Array.init(finalAveragedMat[yIndex].length, (xIndex) =>
                Array.init(1, () => finalAveragedMat[yIndex][xIndex]))
        )

        logD(`Final Height: ${final.length}`)
        logD(`Final Width: ${final[0].length}`)

        const result = new Mat(final, CV_8UC1)

        logD(`Mat cols: ${result.cols}`)
        logD(`Mat rows: ${result.rows}`)

        return result
    }

basshelal avatar Aug 20 '20 10:08 basshelal

it's a long time, it could be solve? this feature is very useful and important! https://github.com/gpujs/gpu.js/projects/1

Akimotorakiyu avatar Jun 19 '21 10:06 Akimotorakiyu

Partitioning of the data should happen automatically if the devices output size is exceeded.

I'm working on a template matching algorithm that needs to compare a lot of pixels. I exceed the maximum output size before I really start to benefit from running the thing on the GPU. Guess I will also be implementing my own partitioning system.

tinkertoe avatar Jul 21 '22 20:07 tinkertoe