sparsemax algorithm - overview
This is a known problem. The first O(n log(n)) algorithm for solving it involved sorting. However there are many other approaches today. The problem is know as:
- continuous quadratic knapsack problem
- singly linearly constrained quadratic programs
The article "Fast algorithm for singly linearly constrained quadratic programs with box-like constraints - 2015" local has an excellent overview of current solution strategies in the introduction.
The fundamentals, O(n log(n))
The first polynomial approach for solving this problem is in Validation of subgradient optimization - 1974 local. This considers the very strict problem where D = I and x >= 0. It's a rather detailed paper, so the specific algorithm is not super clear. It doesn't look like it is polynomial bounded.
This was then generalized more (any D and b >= x >= 0 by: A POLYNOMIALLY BOUNDED ALGORITHM FOR A SINGLY CONSTRAINED QUADRATIC PROGRAM - 1980 local. This has complexity O(log(n) n). This is a very simple algorithm that uses binary search O(log(n)) together with a semi-closed-form solution, in each iteration is calculates a sum C O(n). We could either paralize the binary search O(log(n/k) * n) = O(n) or paralize the sum O(log(n)^2). I think this also requires sorting, but it's not completely clear from the text, the algorithm we have seen may in fact just be a simplification of this.
Median approach, O(n)
This idea of binary search and the semi-closed-form solution, was then improved into linear complexity solvers. The first of such is AN O(N) ALGORITHM FOR QUADRATIC KNAPSACK-PROBLEMS local, which uses the median for searching (I was confused too, but there is an O(n) algorithm for finding the median.). This algorithm is more complex, but conceptually similar to the previous. Whether this is a good strategy for GPU implementation, likely depends on how the median algorithm works.
In fact there is a bunch of papers that uses "the median approach" for solving the problem. And similar binary-search approaches. Breakpoint searching algorithms for the continuous quadratic knapsack problem - 2008 local contains a detained overviews of these. Algorithm 6.1 and 6.2 are of particular interest, since they uses the 0 <= x <= \infty bound. I think they are O(log(n) n), since they don't use the median approach. But the article also discusses the median approaches. The advantages of 6.1 and 6.2 is that they reuse parts of the sum, calculated in each iteration.
The variable fixing methods, O(n^2) but no sorting
The variable fixing methods, finds in each iteration the solution of one variable. This takes O(n^2). While this is relatively bad, they don't depend on sorting or median finding. (did the first binary-search method depend on sorting?). The first paper on this is Disaggregation and resource allocation using convex knapsack problems with bounded variables - 1981.
This was later developed into Variable Fixing Algorithms for the Continuous Quadratic Knapsack Problem - 2008. In practice this algorithm shows to be faster than simple binary search algorithm but only a little faster than median approach. The algorithm is sequential and involves a few sums, which could be parallelized. The sequential part also appears to use a lower and upper bound, indicating that it too could be parallelized, but it is not obvious. This is a nice candidate for GPU because it doesn't require sorting or medians, but it very much comes down to how easy it is to parallelize the sequential part.
On a final remark SOLVING QUADRATIC PROGRAMMING PROBLEMS ON GRAPHICS PROCESSING UNIT - 2015 discuss constrained QP much more generally (interior point), but it's focused on the GPU. It is likely too general for our case, but they show that problem-parallel solving generally is better than data-parallel solving. Which is super nice for us, but if we can do data-parallel solving you can also make a hybrid problem-data-parallel solver, which they do not discuss.