geometry icon indicating copy to clipboard operation
geometry copied to clipboard

[similarity] add similarity distance

Open barendgehrels opened this issue 2 years ago • 6 comments

This is a new similarity distance, comparable to Hausdorff or Frechet.

It calculates the area between two lines and divides them by the line length. This is an average distance, measuring quite well how similar two lines are. It is less sensible to outliers (such as Hausdorff/Frechet) and not discrete.

For example this case (west1/west2):

image

They are quite similar. Distance impressions are: Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308

How it works in detail:

  • it projects all points on the other line
  • and vice versa
  • it adds their own points too. Then both collections have an equal amount of points, and because of the projections, their locations are comparable (especially if they are already similar)
  • it checks if one of the lines should be reversed
  • then it calculates areas of all four-corner rings and sums them
  • and it divides by the shortest length

(See the comment below for an enhanced description)

Zoomed in: image

Apart from this, this PR also adds an experimental geojson_writer

If you like it, I'll also add a version for rings. For polygons/multi it is a bit more complex (especially if the number of objects are unequal), to be worked out.

The geojson_writer is really useful for visualizing results (or debug steps). But it is not completely finished either.

Complete output of the test program:

Simplex cases: 
Frechet: 0.1 Hausdorff: 0.1 Average: 0.1
Frechet: 0.509902 Hausdorff: 0.509902 Average: 0.1
Frechet: 1.58114 Hausdorff: 1.58114 Average: 0.114

Real cases: 
Frechet: 53.4426 Hausdorff: 53.4426 Average: 5.61308
Frechet: 81.0897 Hausdorff: 81.0897 Average: 3.98684

Reversed:
Frechet: 990.622 Hausdorff: 53.4426 Average: 5.61308
Frechet: 1199.04 Hausdorff: 81.0897 Average: 3.98684

Simplified (west): 
Frechet: 494.575 Hausdorff: 494.575 Average: 61.4674
Frechet: 475.107 Hausdorff: 475.107 Average: 24.5124
Frechet: 54.7065 Hausdorff: 54.7065 Average: 0.733346
Frechet: 23.9904 Hausdorff: 23.9904 Average: 0

Simplified (east): 
Frechet: 520.245 Hausdorff: 520.245 Average: 124.301
Frechet: 74.5731 Hausdorff: 74.5731 Average: 7.53191
Frechet: 74.5731 Hausdorff: 74.5731 Average: 0.511786
Frechet: 0 Hausdorff: 0 Average: 0

Not similar: 
Frechet: 8760.18 Hausdorff: 7737.14 Average: 4925.39
Frechet: 8771.21 Hausdorff: 7746.79 Average: 4848.01

Parts (TODO):
Frechet: 831.744 Hausdorff: 831.744 Average: 7.25765
Frechet: 1094.28 Hausdorff: 1094.28 Average: 199.363

Identical: 
Frechet: 0 Hausdorff: 0 Average: 0
Frechet: 0 Hausdorff: 0 Average: 0

Performance:
Frechet: 105 ms
Hausdorff: 78 ms
Average: 498 ms

Showing that:

  • hausdorff was not symmetric (I created an issue for that)
  • frechet gives different results for reversed cases
  • frechet and hausdorff are often, but not always, the same
  • the new method is much slower (we can use an index to enhance this)
  • because frechet/hausdorff are discrete, its behavior is not good when there are points in between (see simplex 2, or the simplified cases)
  • because frechet/hausdorff use a max distance, a spike large influence (simplex 3)

I also tried to give a "partial" result, which happens if one line is partly identical to the other line. This is possible, but not yet finished. It uses the projections and then recursively finds the best division point. But this is not part of the PR.

barendgehrels avatar Jul 12 '23 18:07 barendgehrels

A rephrased description, with a bit of AI help:

To compute the Average Similarity Distance (ASD) between two linestrings, let's consider linestring P and linestring Q. The ASD is derived by comparing the geometric characteristics of the two linestrings.

  1. Projection: Each point of linestring Q is projected onto linestring P, identifying the shortest distance for projection. This projection process creates a modified version of linestring P, denoted as P'. Similarly, each point of linestring P is projected onto linestring Q, resulting in a modified version of linestring Q called Q'.
  2. Augmentation: All points of linestring P are added to P', maintaining their original position, with an offset value of 0 and increasing segment distance. Similarly, all points of linestring Q are added to Q'.
  3. Alignment: P' and Q' are sorted based on their segment index and offset values in increasing order. This sorting ensures that both collections are comparable.
  4. Reversal: If necessary, one of the linestrings is reversed to achieve better alignment between corresponding points in P' and Q'. This step ensures consistency in the comparison.
  5. Area Calculation: The areas of all quadrilaterals, formed by connecting each point with its subsequent point in both P' and Q', are calculated. The sum of these areas is computed.
  6. Normalization: The total sum of areas is divided by the shortest length of the two linestrings P and Q to normalize the ASD value.

The resulting ASD value provides an average distance measure that captures the geometric similarity between the linestrings P and Q. By incorporating projection, augmentation, alignment, and area calculation steps, the ASD offers an enhanced understanding of the similarity between the two linestrings.

barendgehrels avatar Jul 13 '23 17:07 barendgehrels

Thanks, that's interesting!

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Is ASD an official name for this measure? Would giving the function a name more specific than similarity_distance be a good idea? Or do you think that this should be baseline, default similarity measure in the library?

awulkiew avatar Jul 21 '23 22:07 awulkiew

Thank you both for your comments! Cool! Yes, I think similarity_distance would be a good name, it also crossed my mind. I will come back to your comments in more detail next week on Thursday, probably.

barendgehrels avatar Jul 22 '23 13:07 barendgehrels

To clarify, I'm considering whether or not giving it such general name is a good idea. There are various measures of similarity after all. And AFAIU they are not interchangeable. Unless it would be possible to define some kind of standard of similarity allowing to generate comparable results between various algorithms? So the question is which one if any should be considered as the default one in the library? If it's not possible to answer this question then maybe this algorithm should have a specific name like the other ones.

awulkiew avatar Jul 22 '23 14:07 awulkiew

AFAIU the CS-specific operations are closest points, side and area so technically this distance should work in all coordinate systems correct?

Yes, correct, I added a call for geographic coordinate system as well.

Do I understand correctly that you wanted to show a prototype to start a discussion and you plan to implement it with Boost.Geometry structure (strategy, dispatch, etc.) if we agree that this is a good addition to the library?

Yes, indeed

Is ASD an official name for this measure? Would giving the function a name more specific than similarity_distance be a good idea? Or do you think that this should be baseline, default similarity measure in the library?

Okay, I now have average_distance - which is, in fact, what it is. And I think it's a better measure for similarity than hausdorff or frechet, because that is more a "max distance" which might have a large value while all other points are close or equal, apart from the problems because it is discrete. But this might be biased and have to be researched a bit more.

barendgehrels avatar Jul 27 '23 12:07 barendgehrels

Okay, I now have average_distance - which is, in fact, what it is.

Would it be possible to calculate the similarity of areal geometries (rings or polygons) this way, i.e. pass them into this algorithm and internally calculate the similarity of their boundaries? If it is then this name would not work for them because the average distance of two intersecting polygons would always be 0.

awulkiew avatar Sep 25 '23 12:09 awulkiew