Fixed the spectral normalization
Can you give a simple usage example for this, and/or a general idea of how it should be used?
Hi Mike,
this should be a regularization technique described in this paper https://arxiv.org/abs/1705.10941 and it should be a drop-in replacement for the weight decay. The crucial difference to popular weight decay is that it regularizes Lipschitz constant of the final network, which seems to be important for example for training Gans. https://openreview.net/forum?id=B1QRgziT-¬eId=SJok1XB-f
My latest implementation looks as follows:
"""
Spectral norm regularization as proposed in
Spectral Norm Regularization for Improving the Generalizability of Deep Learning, Yuichi Yoshida, Takeru Miyato, 2017
https://arxiv.org/pdf/1705.10941.pdf
"""
function spectral(p::Flux.Optimise.Param, λ::Real)
if ndims(p.x) !=2
return(() -> nothing)
end
n,m = size(p.x)
u = similar(p.x,n)
u .= randn(n)
v = similar(p.x,m)
v .= randn(m)
function ()
u .= p.x * v
v .= (u' * p.x )'
σ = norm(v)/norm(u)
v ./=norm(v)
u ./=norm(u)
p.Δ .+= λ*σ*u*v'
nothing
end
end
function spectralnorm(A,i=1000)
n,m = size(A)
u = similar(A,n)
u .= randn(n)
v = similar(A,m)
v .= randn(m)
for ii in 1:i
v ./=norm(v)
u ./=norm(u)
u .= A * v
v .= (u' * A )'
end
norm(v)/norm(u)
end
SpectralADAM(ps, η = 0.001; β1 = 0.9, β2 = 0.999, ϵ = 1e-08, λ = 0) =
Flux.Optimise.optimiser(ps, p -> Flux.Optimise.adam(p; η = η, β1 = β1, β2 = β2, ϵ = ϵ), p -> spectral(p, λ), p -> Flux.Optimise.descent(p, 1))
#unit test
# A = randn(5)
# A = A + A';
# maximum(abs.(eig(A)[1])) - spectralnorm(A)
but I confess that the results I am getting are very weird. I thought it would be good if it is on the part of the Flux at least for the sake of completness.
Best wishes, Tomas
-------- Original Message -------- Subject: Re: [FluxML/Flux.jl] Fixed the spectral normalization (#115) Local Time: 8 December 2017 7:35 PM UTC Time: 8 December 2017 18:35 From: [email protected] To: FluxML/Flux.jl [email protected] pevnak [email protected], Author [email protected]
Can you give a simple usage example for this, and/or a general idea of how it should be used?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
@pevnak are you still interested in pursuing this? I see the authors released a follow-up paper at https://arxiv.org/abs/1802.05957.
Bump on this @pevnak. If this is too far in the rearview mirror, I'd suggest we open an issue and close this PR. That way it's clear what work is up for grabs.
this type of normalization doesn't seem to be used in current practice, not worth opening an issue unless someone is interested into it