CIL add PD3O with tests

Description

Adding PD3O algorithm with unitests.

Implementation based from A new primal-dual algorithm for minimizing the sum of three functions with a linear operator. With this algorithm we have for free 2 additional algs: : PAPC, David-Yin (sometimes called PDDY). PDHG is a special case. Also, we can have Condat-Vu, AFBA, see details in the above paper. Finally, all of the above algorithms can be combined with stochastic (variance-reduced) algorithms.

Below some results, presented 2 years (Workshop on modern image reconstruction algorithms and practices for medical imaging) ago using the stochastic framework just before removing the 1/n weight convention.

Related issues/links

Closes #1890

Checklist

[ ] I have performed a self-review of my code
[ ] I have added docstrings in line with the guidance in the developer guide
[ ] I have updated the relevant documentation
[ ] I have implemented unit tests that cover any new or modified functionality
[ ] CHANGELOG.md has been updated with any functionality change
[ ] Request review from all relevant developers
[ ] Change pull request label to 'Waiting for review'

Contribution Notes

Please read and adhere to the developer guide and local patterns and conventions.

[ ] The content of this Pull Request (the Contribution) is intentionally submitted for inclusion in CIL (the Work) under the terms and conditions of the Apache-2.0 License
[ ] I confirm that the contribution does not violate any intellectual property rights of third parties

Jun 14 '24 12:06 epapoutsellis

Thanks @epapoutsellis this looks great! We're not far from releasing 24.1, but we'll be planning the next release in a couple of weeks and we can add this to the workplan.

Jun 20 '24 09:06 gfardell

Thanks @epapoutsellis! It looks good. I added a few todos for myself around testing and documentation. Feel free to answer them if you have time before me but I will aim to look again around the beginning of July.

Jun 21 '24 11:06 MargaretDuff

Hi @epapoutsellis - I am going through this in detail and have a few questions:

Comparing the code to the paper, I have lost this term: Can you point me in the right direction in the code?
How do we want to set default step sizes: The paper suggests convergence if where $\beta =1/f.L$ and $\lambda=\delta\gamma$ and thus I suggest but don't know if you have any experience in practice
I presume we raise a warning not an error for testing purposes so we can compare against PDHG?
Do we want a check convergence option, like we have for PDHG and FISTA?

Jul 02 '24 10:07 MargaretDuff

How do we want to set default step sizes: The paper suggests convergence if where β=1/f.L and λ=δγ and thus I suggest but don't know if you have any experience in practice

Actually, playing with this further. I considered

    def test_pd3o_vs_fista(self):
        alpha = 0.1  
        G = alpha * TotalVariation(max_iteration=5, lower=0)

        F= 0.5 * L2NormSquared(b=self.data)
        algo=FISTA(f=F, g=G,initial=0*self.data)
        algo.run(200)

        F1= 0.5 * L2NormSquared(b=self.data)
        H1 = alpha *  MixedL21Norm()
        operator =  GradientOperator(self.data.geometry)
        G1= IndicatorBox(lower=0)
        norm_op=operator.norm()
        

        algo_pd3o=PD3O(f=F1, g=G1, h=H1, operator=operator, initial=0*self.data, gamma=gamma, delta=delta)
        algo_pd3o.run(500)

        np.testing.assert_allclose(algo.solution.as_array(), algo_pd3o.solution.as_array(), atol=1e-2)
        np.testing.assert_allclose(algo.objective[-1], algo_pd3o.objective[-1], atol=1e-2)

For the proposed default step sizes:

gamma =2./F.L
delta = 1./(gamma*norm_op**2)

PD3O does not converge to the correct solution however with

gamma = 0.99*2./F.L
delta = 1./(gamma*norm_op**2)

it does converge. Any thoughts?

Jul 02 '24 12:07 MargaretDuff

For the step size, it sounds like the "safety factor" of 0.99 is sensible.

For you question 1 on AA^T having gone missing. Maybe I misunderstand, but I read the implemented code as based on equation 4 (also the code comment states this), which is reformulated from equation 3, and the AA^T appears only in eq 3.

Not sure if this affects whether AA^T should be included in the step size bound as well?

Jul 03 '24 09:07 jakobsj

Eq. (4a) contains the gradient of l^*. I don't see that in the code?

Jul 03 '24 09:07 jakobsj

For you question 1 on AA^T having gone missing. Maybe I misunderstand, but I read the implemented code as based on equation 4 (also the code comment states this), which is reformulated from equation 3, and the AA^T appears only in eq 3.

Aaah - I was looking at the arxiv version of the paper which had a different eq4. That makes more sense.

Jul 03 '24 09:07 MargaretDuff

Eq. (4a) contains the gradient of l^*. I don't see that in the code?

The paper considers optimising where we are considering an objective $f(x)+g(x)+h(Ax)$. I don't know enough about infimal convolutions to know which $l$ makes the two equivalent

Jul 03 '24 09:07 MargaretDuff

For reference the link to the published version: https://link.springer.com/article/10.1007/s10915-018-0680-3

In the lines just below eq (4) it says l^* is considered equal to zero. So perhaps this implementation assumes l^*=0. @epapoutsellis can you confirm?

Jul 03 '24 09:07 jakobsj

Re. question 3, whether a warning. I think it makes sense to allow the algorithm to run with ZeroFunction being passed. I think also consistent with allowing this in PDHG and FISTA? Have you checked if the algorithm behaves sensibly with a ZeroFunction passed, including converging to expected result? I think having a warning is okay to recommend the user to switch to PDHG, but allow it to run. If the algorithm runs and behaves as expected I'd also be okay with omitting the warning.

Jul 03 '24 09:07 jakobsj

For reference the link to the published version: https://link.springer.com/article/10.1007/s10915-018-0680-3

In the lines just below eq (4) it says l^* is considered equal to zero. So perhaps this implementation assumes l^*=0. @epapoutsellis can you confirm?

I think that makes sense, if $l= \iota_{0}$ then $h\square l = h$, $l^* =0$ and thus $\nabla l^*=0$

Jul 03 '24 09:07 MargaretDuff

Re. question 3, whether a warning. I think it makes sense to allow the algorithm to run with ZeroFunction being passed. I think also consistent with allowing this in PDHG and FISTA? Have you checked if the algorithm behaves sensibly with a ZeroFunction passed, including converging to expected result? I think having a warning is okay to recommend the user to switch to PDHG, but allow it to run. If the algorithm runs and behaves as expected I'd also be okay with omitting the warning.

Yes, one of the tests compares the behaviour of PDHG and PD3O with a zero function. I think in PDHG we have the strong-convexity additions so it is probably worth the user using that over PD3O

Jul 03 '24 09:07 MargaretDuff

OK great yes then I suggest keeping the warning to allow users to run PD3O with ZeroFunction to deliberately create the special case algorithm and compare with our PDHG implementation, and continue to have the warning recommend the user to switch to PD3O for efficiency reasons.

Jul 03 '24 09:07 jakobsj

BTW, a few places appear to use the number 0 instead of the letter O in ie "PD30" instead of "PD3O", please search and replace.

Jul 03 '24 09:07 jakobsj

Yes, I use equations below from https://doi.org/10.1007/s10915-018-0680-3

No need to cover the inf-conv case for now. Maybe an InfimalConvolution Function class in the future could be useful and compute the pdgap for the PD3O case. This is why, I included only primal objective in the update_objective.
Better to allow f=ZeroFunction, with a warning. In this case, user can go back to PDHG which is safer in terms of default step-sizes.
For the step-sizes, better to have 0.99*2 which close to theory. In practice, it depends on iteration number used for the power method and the projection operator (shape of ig, ag etc). "Better" step-sizes are in https://arxiv.org/pdf/2201.00139, table 2. This also related to the previous point. In theory, you can use any $(0,\infty)$ parameter when f=0. In practice, I noticed that you need to tune this parameter. Another option is to use instead an epsilon/2*L2normSquared() where epsilon is small.

Jul 04 '24 09:07 epapoutsellis

For the step-sizes, better to have 0.99*2 which close to theory.

Thanks Vaggelis, I think i have done this.

In theory, you can use any (0,∞) parameter when f=0. In practice, I noticed that you need to tune this parameter. Another option is to use instead an epsilon/2*L2normSquared() where epsilon is small.

I have set the same default as PDHG when f=0 so 1.0/operator.norm() for both gamma and delta

Jul 11 '24 15:07 MargaretDuff

@epapoutsellis @jakobsj - please can I revive this PR. Do you have any more comments?

Aug 13 '24 11:08 MargaretDuff

Jenkins is happy

Aug 22 '24 09:08 MargaretDuff