nflows
nflows copied to clipboard
Refactored rational-quadratic spline transforms to run faster
Hi,
I've noticed that the methods in rational_quadratic.py
can be easily refactored to make them run ~25% faster.
The main change in unconstrained_rational_quadratic_spline
is to avoid using masked select, which can be quite inefficient with dense masks, since it requires assembling all the "unmasked" elements into a new tensor. Instead, in order to do masked insert into a predefined zero tensor, it is generally cheaper to multiply the input tensor with a mask and add it to the target tensor, as I've done in this PR.
I've also made a couple of changes in rational_quadratic_spline
about computing widths
, heights
and cumwidhts
, cumheights
tensors. The refactored implementation removes the redundancy of some of the operations in the original implementation.
The rational-quadratic spline flow as used in the NSF paper runs about 25% faster with these changes. I think some further improvements can be achieved if the searchsorted
is replaced with torch.searchsorted
when ran with the custom CUDA kernel as described in #19, but I haven't touched it since it would affect the other spline flows too.
I suppose the other spline flow methods can be refactored in a similar way. If you'd prefer I can make the necessary changes to them too in this PR.
Best, Vaidotas