ITKMinimalPathExtraction icon indicating copy to clipboard operation
ITKMinimalPathExtraction copied to clipboard

Gradient sometimes zero (which prevents a gradient decent optimizer from decenting)

Open romangrothausmann opened this issue 6 years ago • 46 comments

I have experienced many cases where I can extract a path with the IterateNeighborhoodOptimizer but where the GradientDescentOptimizer does not start to decent even though the speed function is continuous (based on a binary segmentation), apparently because the initial local gradient is calculated to be zero (even though start- and end-point, speed function etc are kept the same). I have a vague feeling that this thresholding: https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/6ae35b9d044ff2be22ce42776c64258be9252313/include/itkSingleImageCostFunction.hxx#L145-L152 to a hard coded value of 15.0: https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/6ae35b9d044ff2be22ce42776c64258be9252313/include/itkSingleImageCostFunction.hxx#L140 could be the culprit.

Another possible reason I could imagine is that the floating point precision might not suffice under some circumstances to calculate a non-zero gradient even if there is no extrema in the cost function at the current optimizer position.

I use this CLI for the testing: https://github.com/romangrothausmann/ITK-CLIs/blob/bfb1312142d505cacd6770e4d5acc23475290c8f/min-path_seg.cxx

romangrothausmann avatar Apr 12 '19 09:04 romangrothausmann

Where could I set the precision used for the gradient calculation or is it hard-coded somewhere to float or double?

romangrothausmann avatar Apr 12 '19 09:04 romangrothausmann

While testing that #62 works as expected, even setting DerivativeThreshold of 1e9 did not solve the issue reported here, that under some (unknown) circumstances no gradient decent optimizer starts to run, apparently because the computed gradient is zero even though the IterateNeighborhoodOptimizer does find lower values even with the same step-size. @thewtex Any idea what could cause this? Can you re-open the issue (I can't)?

romangrothausmann avatar Apr 16 '19 08:04 romangrothausmann

It seems float is used by default for the gradient calculation https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkPhysicalCentralDifferenceImageFunction.h#L41-L43 and to my understanding there is no direct way to override this default when instantiating SpeedFunctionToPathFilter nor by separate instantiating its SingleImageCostFunction, because the PhysicalCentralDifferenceImageFunction https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkSingleImageCostFunction.h#L103 cannot be re-set. However, the final default that gets used seems to come from the superclass CostFunctionTemplate< TInternalComputationValueType >::ParametersValueType https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkSingleImageCostFunction.h#L91 Is there any way to set this to double when using SpeedFunctionToPathFilter in a program (e.g. in min-path_seg) without modifying the code of this module by adding another template parameter for this? Could the limited precision be the source for this issue?

romangrothausmann avatar Apr 16 '19 08:04 romangrothausmann

@romangrothausmann we could also change the default to double if you find that it corrects behavior.

thewtex avatar Apr 23 '19 22:04 thewtex

Many thanks @thewtex for your reply. How can I specify to use double in this case? I.e. how can I specify the template parameter TInternalComputationValueType of the superclass when it is inherited https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkSingleImageCostFunction.h#L52-L53 I cannot find where the default is set for this. To my understanding that must happen somewhere, but there seems no default specification for template< typename TInternalComputationValueType >. Or what would be other hacky ways that would suffice for testing?

romangrothausmann avatar Apr 24 '19 08:04 romangrothausmann

It looks like the default is already double:

https://github.com/InsightSoftwareConsortium/ITK/blob/99aaeaf1a553866f1e36d4ec363d2a37e1377165/Modules/Numerics/Optimizers/include/itkCostFunction.h#L67

thewtex avatar Apr 24 '19 15:04 thewtex

Many thanks @thewtex for pointing that out. I was not aware of such a way to define a default and was only used to something like https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkPhysicalCentralDifferenceImageFunction.h#L43

@thewtex could you re-open the issue, because it does not seem to be resolved and since the default already is double I've currently no idea what else could be the source for the problem, do you?

romangrothausmann avatar Apr 25 '19 07:04 romangrothausmann

I am not sure -- unfortunately, I do not have time to dig into it at the moment.

thewtex avatar Apr 25 '19 11:04 thewtex

Edited the issue description, now mentioning that the speed image is based on a binary segmentation image (dilated by a parabolic SE in a simple case), so noise should not be the issue.

romangrothausmann avatar Apr 25 '19 11:04 romangrothausmann

How about using an inverse distance map generated from the segmentation for the speed image?

thewtex avatar Apr 25 '19 12:04 thewtex

Thanks @thewtex for that idea, which would be much quicker to generate than my current approach using a fast-marching map of the 3D skeleton (inside the segmentation) squeezed into the range [0; 1] by a sigmoid. However, this has the advantage that the speed along the center line is always close to 1 independent of the local extent of vessels (which vary by a factor of 1000 in our dataset).

Still, I think that another speed map would only represent a work-around for the issue here, because I would expect the local gradient not to equal zero if IterateNeighborhoodOptimizer does start to propagate at the start-point, even reaching the end-point.

romangrothausmann avatar Apr 25 '19 12:04 romangrothausmann

Just to mention that I use this CLI for the testing: https://github.com/romangrothausmann/ITK-CLIs/blob/bfb1312142d505cacd6770e4d5acc23475290c8f/min-path_seg.cxx

romangrothausmann avatar Apr 25 '19 12:04 romangrothausmann

Since the testing CLI uses linear interpolation, #63 seems not to be the source for this issue.

romangrothausmann avatar Apr 29 '19 10:04 romangrothausmann

The gradient calculation uses the image spacing https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkPhysicalCentralDifferenceImageFunction.hxx#L64 whereas the iterateNeighborhoodOptimizer uses its neighborhood size for the stepping https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/src/itkIterateNeighborhoodOptimizer.cxx#L191-L193 Could this be the source for the issue?

romangrothausmann avatar Apr 29 '19 10:04 romangrothausmann

I have tracked down what I believe to be the problem, and will attempt to generate tests to show it.

The root of the issue is here

The fast marching tools generate arrival time images that the gradient descent algorithm tracks through. When setting up the arrival image, all voxels are set to the maximum floating point value. If any of these voxels remain in the neighbourhood of the end point, the gradient calculation is messed up.

The default values are OK for images with unit spacing and unit speed functions, but break down for fractional spacing etc. It turns out that the units for TargetOffset are voxels, and thus is shouldn't depend on the tolerance setting.

TargetOffset should be 2. No need to depend on tolerance.

richardbeare avatar Apr 30 '19 05:04 richardbeare

Many thanks @richardbeare for looking into this. I had stumbled over https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkSpeedFunctionToPathFilter.hxx#L94 but wasn't sure whether FastMarching would expect voxel or pyhsical offset (and forgot to investigate furhter). Please go ahead an open a PR that removes the multiplication and I will try to test if that solves the issue.

romangrothausmann avatar Apr 30 '19 10:04 romangrothausmann

I have been trying to reproduce the behaviour in the existing tests and have lots of other problems. Going to take some work to sort it out.

I now think the TargetOffset should probably be:

marching->SetTargetOffset( 2.0 * maximumvoxelspacing); 

But will experiment further to sort it out.

The current tests do break when I modify the test images to have much smaller voxels, but part of the problem is the initial step size is way too large - the first point on the rendered path is nowhere near the correct starting point. It isn't giving the error that the gradient is initially zero. Perhaps there is something specific to the 3D case that caused that error.

richardbeare avatar Apr 30 '19 12:04 richardbeare

Thinking further - it seems to me that there is no approach using the TargetOffset that will guarantee that all pixels in the immediate neighbourhood have the arrival function evaluated. A guess can be made if the value of the speed function in the neighbourhood is known, but this sounds error prone. I think the most reliable option will be to fill the neighbourhood with dummy target points, which get used to terminate the fast marching step but aren't used by the path phase.

richardbeare avatar Apr 30 '19 23:04 richardbeare

Yet more thinking and playing around with tests of fractional voxels, and the existing test images don't trigger the problem we saw with Roman's data. However things are clearer to me now. If the backtracked path passes a voxel that isn't visited by the fast marching filter then there will be problems in the gradient calculation. This could happen if the face connected neighbourhood of the endpoint includes such points, or if the path transits the edge of a mask. In general the result will be an incorrect gradient of some sort.

The best option may be to modify the PhysicalCentralDifferenceImageFunction to ignore neighbours with the fast marching marker value, and use a one sided difference instead in those cases.

richardbeare avatar May 03 '19 13:05 richardbeare

Many thanks @richardbeare for the thorough investigations.

existing test images don't trigger the problem we saw with Roman's data.

I could imagine that it could be the small voxel size of my data (0.00015) or the fact that it is quite a large dataset (3499 x 3499 x 2796 to be precise). Passing a "voxel that isn't visited by the fast marching filter" is surely critical, and I have experienced that before, but then the gradient decent did start and just "got stuck" before reaching the end. However in this case the gradient decent does not even start, and I've checked that its close proximity is definitly not masked.

The best option may be to modify the PhysicalCentralDifferenceImageFunction to ignore neighbours with the fast marching marker value, and use a one sided difference instead in those cases.

That's a great idea, currently it used the left and the right value for the partial derivative https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkPhysicalCentralDifferenceImageFunction.hxx#L73-L74 This could lead to a very large gradient (even in the wrong direction) in case one side is not reached by the fast marching filter which then gets thresholded (even with a DerivativeThreshold of 1e9): https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkSingleImageCostFunction.hxx#L149-L152

Another way would be to extent the gradient decent code with that from the neighbourhood optimizer in case the gradient is found to be zero. This however would (I think) the introduction of a new optimizer, whereas @richardbeare idea would just touch the PhysicalCentralDifferenceImageFunction and nothing more.

@richardbeare how to decide which side to take in such cases? Would it be possible to mix them in case the border of the cost function is reached for one partial derivative to the right and for the other to the left? I guess in the end it is better to have an "odd" gradient than none.

romangrothausmann avatar May 04 '19 09:05 romangrothausmann

Ah, I hadn't noticed that threshold. It certainly explains why the starting derivative can be zero. I was assuming that a large starting gradient caused a jump into an illegal part of the image.

The other way in which the effect may happen, without the target point touching the mask, is due to termination of the fast marching stages. I am attempting to create this effect in some tests.

richardbeare avatar May 04 '19 10:05 richardbeare

Yes, I think, considering your findings, that the initial gradient computation leads to a very large gradient (possibly even in false direction) "due to termination of the fast marching" just at the start point, which then gets zeroed by that thresholding. A one-sided partial derivative is likely able to solve this issue. Currently, I cannot say why this problem is not always hit.

romangrothausmann avatar May 04 '19 19:05 romangrothausmann

Slightly more progress - I've managed to reproduce the situation where the initial gradient is zero. It happens when the values in the speed image are very low, leading to neighbours remaining unevaluated by fast marching. This leads to a very large gradient that gets set to zero by the magnitude check.

Checking for illegal neighbours before computing the gradient is definitely the way to address these problems. The question is how to fit it into the current framework. We could hard-code the value used to check for illegality, much along the lines of the current gradient threshold. It doesn't seem possible to plug different gradient calculators into the SingleImageCostFunction, thus we can't make a "safe" version of PhysicalCentralDifferenceImageFunction to plug in at user discretion.

The trickiest part of the problem is deciding on the design. Once the value is known the rest is simple, but messy looking.

richardbeare avatar May 06 '19 11:05 richardbeare

I've managed to reproduce the situation where the initial gradient is zero. It happens when the values in the speed image are very low, leading to neighbours remaining unevaluated by fast marching.

That's really great! Many thanks @richardbeare for finding that problem, I would not have expected it there. Odd though, that it can happen that some voxels remain unevaluated by fast marching. @thewtex is that a known issue of the current implementation of the fast marching filter? If so, should that be addressed or is it inevitable?

The question is how to fit it into the current framework. We could hard-code the value used to check for illegality, much along the lines of the current gradient threshold.

Well, I'd go for a hard-coded value only if it is definitely fixed, otherwise a hard-coded default that could possibly changed at user-level (like #62).

It doesn't seem possible to plug different gradient calculators into the SingleImageCostFunction, thus we can't make a "safe" version of PhysicalCentralDifferenceImageFunction to plug in at user discretion.

Hm, wouldn't it be possible with a change of the implementation of SingleImageCostFunction, making providing another PhysicalCentralDifferenceImageFunction an optional template parameter?

This leads to a very large gradient that gets set to zero by the magnitude check.

What I do wonder though: How comes that this applies to all partial derivative gradients? Or the other way round, why is not even a single partial derivative gradient OK (if the NN-opt runs)? Sure, the direction would not be appropriate, but at least that should at least make the gradient decent optimizer to advance one step.

romangrothausmann avatar May 07 '19 08:05 romangrothausmann

OK, Less simple than I thought. The PhysicalCentralDifferenceImageFunction uses the interpolator to determine image values one voxel away along each axis. My original thought was to check whether those neighbours were legal, and not use the interpolation result if they aren't. However the interpolator is using the entire neighbourhood of the neighbouring voxel, and thus will return a spurious value if one of those neighbours is illegal. e.g.

suppose we're checking the voxel to the "left" on the x dimension. First we check whether the voxel at the left index is legal. If it is, then we ask for the interpolation result. However the interpolator uses all neighbours of that voxel (unless we're lined up with the voxel centres).

Not sure how to address this. Ideal case would be to pass a mask to the interpolator to indicate which values are legal.

Perhaps a special interpolator is the way to go.

richardbeare avatar May 07 '19 10:05 richardbeare

I've implemented a safe interpolator and some tests. The code is in the SafeInterpolator branch of my repo.

As pointed out in #63, interpolators are used in two places, and the problem with large neighbouring values can happen in either. Currently I have modified PhysicalCentralDifference to use the new interpolator, but it would be better if we allowed it to be set via the cost function.

I expect that this change will allow better behaviour for speed functions through funny shaped masks, but I'd image that really rough speed functions are likely to continue to cause problems. Not sure there's any way around that.

There is a definite need for the user to set learning rates and termination values and derivative thresholds based on voxel size and speed function values, and we probably should provide more documentation about that as it is extremely difficult to figure out why errors occur.

May be that the regular step gradient descent is much more robust.

richardbeare avatar May 08 '19 02:05 richardbeare

The reason the fast marching leaves pixels unevaluated (besides the zero speed ones), is that it stops after reaching all the targets. There is a parameter that causes it to proceed slightly beyond the target, and this is set by default, but sometimes the speed can be low enough for the front to not reach the neighbouring pixel.

richardbeare avatar May 08 '19 03:05 richardbeare

I've implemented a safe interpolator and some tests. The code is in the SafeInterpolator branch of my repo.

Many thanks @richardbeare, I wanted to try and had a look but I can't find an implementation for itkLinearInterpolateSelectedNeighborsImageFunction, which is used here: https://github.com/richardbeare/ITKMinimalPathExtraction/commit/5e6dcb30da2066bddad89d4e3e3f6509d7c39387#diff-709a2cd1ccbb3cef49595927b0bdaa9dR112

suppose we're checking the voxel to the "left" on the x dimension. First we check whether the voxel at the left index is legal. If it is, then we ask for the interpolation result. However the interpolator uses all neighbours of that voxel (unless we're lined up with the voxel centres).

Hm, how about not interpolating but using the left voxel value (i.e. voxel center)? The code already uses the image spacing, so I'd say it already aims to take the value of the voxel center: https://github.com/InsightSoftwareConsortium/ITKMinimalPathExtraction/blob/2ff1ab58f30842ac57d76c8bc9462fcbc7523e67/include/itkPhysicalCentralDifferenceImageFunction.hxx#L64 That way we could avoid the need for using an interpolator and would indirectly resovle #63.

I'd image that really rough speed functions are likely to continue to cause problems.

Perhaps some form of smoothing the speed function would help there. At least that would be something for a user to inspect and take care of, i.e. would not need a modification of the current code. As far as I can see, for me it would be fine if the gradient decent starts running but stops before reaching the end. In that case I would know I would have to investigate the local situation including checking the speed function at that point. In the case here, the problem is that it is not the speed function but the cost function that is normally not directly accessible under common usage.

The reason the fast marching leaves pixels unevaluated (besides the zero speed ones), is that it stops after reaching all the targets.

I see. The way I understood your comment above was, that I thought you meant even on the way of the fast marching front some pixel would not be evaluated, even far off from the target or zero speed pixel. No evaluation for zero speed and not beyond the offset of reaching the target is clear.

romangrothausmann avatar May 08 '19 08:05 romangrothausmann

@romangrothausmann I've added yet another branch with a version allowing the interpolator to be changed by the user. The default is to keep the linear interpolator, but now the user can modify the cost function interpolator, and the cost function will provide that interpolator to the gradient function on initialization. There may be some problems if the interpolator gets changed after initialization, but I don't think that is the usual use case. I don't think these changes break any existing code, but see how you go.

richardbeare avatar May 10 '19 00:05 richardbeare

Tested with ITK @ https://github.com/romangrothausmann/ITK/commit/47768d88e02038cd89c9dc89208a5299b0cb5d65 using https://github.com/richardbeare/ITKMinimalPathExtraction/commit/2f54421df763e8121c5a3f828e630c85769c81c0 for ITKMinimalPathExtraction. Setting the special interpolator works (https://github.com/romangrothausmann/ITK-CLIs/commit/b6b9667392016c2f02797a1b65866b71201807aa). With that the gradient decent optimizer did start for our test case and got as close to the next way point as demanded by the termination value set. However, for the following run to the next target point the gradient decent optimizer again did not start to run due to a zero gradient. I could imagine that the last point from the grad. dec. opt. from the former run was still to far away from the way-point to fall into the valid region of the newly calculated cost function.

romangrothausmann avatar May 15 '19 10:05 romangrothausmann