OptimalTextures
OptimalTextures copied to clipboard
Code not optimized for GPU
Hi @JCBrouwer, I've been playing with your code. It's really good!
But the only issue is it doesn't seem like it's optimized for GPU. I mean the GPU utilization on average is ~40%.
Do you have any suggestions regarding optimizing it for running on GPU?
Regards, Rahul Bhalley
Hi Rahul, thanks for your interest in the code! I've updated the PyTorch version and swapped over to torch.linalg.eigh like you suggested in #3.
In terms of the performance issues, I believe it's mainly due to the fact that the data is relatively small and so can't saturate the GPU. When using the multi-scale mode the image is first optimized at a smaller resolution and progressively upscaled to the final desired size.
This leads me to believe that the low utilization is primarily due to overhead of repeatedly launching many small CUDA kernels. To me this sounds like an ideal setting for torch's CUDA graphs API.
It might require a bit more detailed profiling to be sure that this is the issue though.
Alright, did some quick profiling, looks like it's not the kernel overhead, but just host operations in general...
Woah! A ~90% speedup will make it really fast! I have few questions:
- What does 'host_wait' means? Is it the GPU waiting for CPU to complete its task?
- If so, any guidance how to track this?
- What's the name of this profiling tool?
The plots are from Holistic Trace Analysis. 'host_wait' is indeed the GPU waiting for the CPU to give it work.
Looking a bit closer at the actual traces shows that drawing the random rotation dominates the time of each histogram matching iteration. Just replacing the .item()
call in there with a .clone()
helps a little as it saves a round-trip to host memory, but overall utilization still isn't great. I also tried decorating the function with @torch.jit.script
but it didn't help that much either. The trace of this function is still pre-dominantly CPU operations even though the device
is correctly specified as 'cuda' as far as I can tell. I wonder if there's some way to vectorize this operation?
Another small improvement is using the 'chol' histogram matching method instead of 'pca'. Doing a cholesky decomposition is quite a bit faster than running the eigenvalue solver.
One last thing that helped quite a bit for me is to set torch.backends.cudnn.benchmark = False
. This is because the implementation repeatedly cycles through forward passes at different resolutions which requires the cudnn autotuner to re-run every time for just a single forward pass.
I also tried cutting out some of the encode/decode steps which are happening at the beginning and end of each pass, but it seems like the feature inverters are actually separately trained for each depth they invert from, so this ruins the quality of results.
You can see some of the things I tried in this branch.
Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.
Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.
I deeply apologize for not replying. I got a little sick right after opening this issue. I'll surely test it out & let you know. Thank you for doing all this. :)
Not sure how much you changed the code but my first script run fails to converge. I used my same previous arguments. Also tried changing seed. Now I'll just start from where you started (profiling the previous code) and then make changes slowly to the code.
Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Traceback (most recent call last):
File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 283, in <module>
pastiche = texturizer.forward(pastiche, styles, content, verbose=True)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 112, in forward
for _ in range(self.iters_per_pass_and_layer[p][l - 1]):
pastiche_feature = optimal_transport(pastiche_feature, style_features[l], self.hist_mode)
~~~~~~~~~~~~~~~~~ <--- HERE
if len(content_features) > 0 and l >= 2: # apply content matching step
File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 168, in optimal_transport
rotated_style = style_feature @ rotation
matched_pastiche = hist_match(rotated_pastiche, rotated_style, mode=hist_mode)
~~~~~~~~~~ <--- HERE
pastiche_feature = matched_pastiche @ rotation.T # rotate back to normal
File "[/workspace/OptimalTextures/histmatch.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/histmatch.py)", line 37, in hist_match
else: # mode == "sym"
eva_t, eve_t = torch.linalg.eigh(cov_t, UPLO="U")
~~~~~~~~~~~~~~~~~ <--- HERE
Qt = eve_t @ torch.sqrt(torch.diag(eva_t)) @ eve_t.T
Qt_Cs_Qt = Qt @ cov_s @ Qt
RuntimeError: linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 301).
Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)
One thing you can do to help with convergence is to increase the eps
argument of hist_match()
. I've just pushed another small update which is even a little faster on my machine (with eps
bumped up quite a bit).
Profiler is still showing random_rotation()
as the bottleneck, but I'm just not sure how to make that more efficient.
Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)
Okay, did try that. But the results are now inferior to your previous code (before I pinged you).
I used same command for style transfer: python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.2 --hist chol --seed 0
.
Synthesis with previous source code
Synthesis with current modification you made
I am still reading the paper (started today) so I am far from understanding the code. But I will, soon.
One thing you can do to help with convergence is to increase the eps argument of hist_match(). I've just pushed another small update which is even a little faster on my machine (with eps bumped up quite a bit).
How much time it takes and what resolution? For me, these took 36s (previous code) and 34.5s (current code). I didn't try multiple runs so it's not an average time.
My bad, I missed swapping the if statement's condition when I reversed the for loop's direction.
For me the original code was taking about 30 seconds for the simple texture synthesis case and now is around 11 seconds on a 1080 ti.
I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...
I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...
Interesting, even on my side texture was synthesized correctly.
OMG, the texture is very heavy & large scale now.
Now, I'm also unable to push the resolution above 1024.
Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 1, size 512
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 2, size 768
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 3, size 1024
Layer: relu5_1
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ │
│ /workspace/OptimalTextures/optex.py:283 │
│ in <module> │
│ │
│ 280 │ │ from time import time │
│ 281 │ │ │
│ 282 │ │ t = time() │
│ ❱ 283 │ │ pastiche = texturizer.forward(pastiche, styles, content, verbose=True) │
│ 284 │ │ print("Took:", time() - t) │
│ 285 │ │
│ 286 │ save_image(pastiche, args) │
│ /workspace/OptimalTextures/optex.py:116 │
│ in forward │
│ │
│ 113 │ │ │ │ │ │
│ 114 │ │ │ │ │ if len(content_features) > 0 and l <= 2: # apply content matching s │
│ 115 │ │ │ │ │ │ strength = self.content_strength / 2 ** (4 - l) # 1, 2, or 4 de │
│ ❱ 116 │ │ │ │ │ │ pastiche_feature += strength * (content_features[l] - pastiche_f │
│ 117 │ │ │ │ │
│ 118 │ │ │ │ if self.use_pca: │
│ 119 │ │ │ │ │ pastiche_feature = pastiche_feature @ style_eigvs[l].T # reverse pr │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (80) must match the size of tensor b (64) at non-singleton dimension 2
Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?
For me the following is working fine on the current main
branch.
python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448
Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?
For me the following is working fine on the current
main
branch.python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448
Could be something wrong on my end if yours is working fine. I'll not ping you until I understand the whole paper and your code. I don't want to consume your time. You might be busy somewhere else. :) thanks for your help btw.