ChainRules.jl `norm` at zero

From https://github.com/JuliaDiff/ForwardDiff.jl/issues/547, note that the rule for norm gives zero gradient at x=0. It might be preferable to pick something like a sub-gradient?

julia> using Zygote, ForwardDiff, LinearAlgebra

julia> for g in [Zygote.gradient, ForwardDiff.gradient]
       @show g
       for f in [norm, x -> sqrt(sum(abs2, x))]
         @show f
         @show g(f, [eps(),0])
         @show g(f, [0,eps()])
         @show g(f, [0,0])
       end
       end
g = Zygote.gradient
f = LinearAlgebra.norm
g(f, [eps(), 0]) = ([1.0, 0.0],)
g(f, [0, eps()]) = ([0.0, 1.0],)
g(f, [0, 0]) = ([0.0, 0.0],)   # rule from ChainRules
f = var"#17#18"()
g(f, [eps(), 0]) = ([1.0, 0.0],)
g(f, [0, eps()]) = ([0.0, 1.0],)
g(f, [0, 0]) = ([NaN, NaN],)   # with hand-written norm, 0/0
g = ForwardDiff.gradient
f = LinearAlgebra.norm
g(f, [eps(), 0]) = [1.0, 0.0]
g(f, [0, eps()]) = [0.0, 1.0]
g(f, [0, 0]) = [0.0, 1.0]      # this picks a sub-gradient?
f = var"#17#18"()
g(f, [eps(), 0]) = [1.0, 0.0]
g(f, [0, eps()]) = [0.0, 1.0]
g(f, [0, 0]) = [NaN, NaN]

Oct 08 '21 17:10 mcabbott

[0.0, 0.0] seems right to me; but maybe i am missing something important. Breaking symmetry and choosing either [1.0, 0.0] or [0.0, 1.0] seems icky. I guess we could do fill(inv(sqrt(length(x))), length(x)), though that also is a arbitrary choice of perturbing off on a "positive diagonal"

Oct 12 '21 10:10 oxinabox

Seems to be norm would often be used in an optimization problem, where the optimum would be achieved when norm(...) == 0, so the [0,0] gradient makes sense to me. The only other way I can thinking of how one would get exactly a 0-norm is if one initialized points such that exactly a 0-norm was formed, which doesn't seem like our problem.

Oct 12 '21 11:10 sethaxen

The concern would be if x==[0,0] wasn't the optimum, then you could get stuck there. And you needn't initialise there, you could for instance be adding some noise & restricting, like x_next = clamp.(x .+ randn.()./100, 0, 1).

Mathematically the answer will depend on what direction you approach this point from. Which could lead you to argue that no limit exists, and the right answer is then NaN. But for optimisation, probably it's better to pick one?

That said, this hasn't bitten me, but it came up in the linked ForwardDiff issue.

Oct 12 '21 14:10 mcabbott