gtfs-validator icon indicating copy to clipboard operation
gtfs-validator copied to clipboard

Potential false positives for equal_shape_distance_diff_coordinates

Open isabelle-dr opened this issue 2 years ago • 1 comments

Problem We hear from users that equal_shape_distance_diff_coordinates (which is currently an error) is often present in datasets that contain shapes, and the work needed to fix this issue in the datasets gives an incentive for users not to use shapes at all.

This rule was initially implemented in PR #1083, alongside two others:

  • decreasing_shape_distance: error, and
  • equal_shape_distance_same_coordinates: warning,

with the intention of validating the shapes.xt Reference:

Values must increase along with shape_pt_sequence; they must not be used to show reverse travel along a route.

What to do Re-visit if the conditions that trigger equal_shape_distance_diff_coordinates should really be an error: talk to the community, and analyze production data. Consider lowering the severity to a warning and opening a discussion in the specification to make it clearer.

Additional Context These three rules were initially created to replace the decreasing_or_equal_shape_distance notice because this rule was triggered by two things that deserved to be treated differently:

  1. shape_dist_traveled decreases between two consecutive shape points (which is a clear violation of the spec)
  2. shape_dist_traveled is equal between two consecutive shape points (also a violation but is not as big of a problem)

By digging deeper into number 2 above, we noticed that we were seeing two cases in production data: 2.1 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are equal (which seems fine) 2.2 shape_dist_traveled is equal between two consecutive shape points and the lat/long coordinates are not equal (which seems like a problem, but it could be caused by the scheduling software that rounds shape_dist_traveled when the two shape points are really close)

We went ahead and made our own interpretation of the specification based on what we saw in the production data: condition 2.1 would be a warning, whereas conditions 1 & 2.2 would be errors, which is slightly less strict than the spec that strictly mentions "must increase".

isabelle-dr avatar Sep 19 '22 16:09 isabelle-dr

Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.

github-actions[bot] avatar Oct 03 '22 16:10 github-actions[bot]

Posting on the behalf of Marcy Jaffe with the National RTAP.

I'd like to offer training and recommend your MD Schedule Validator tools if it might be possible to reconsider as warning (vs. error) for equal_shape_distance_diff_coordinates When Google Validates it is a warning

Screenshot 2023-01-24 at 9 29 53 AM For Mobility Data it is an error - which I will need to advise in my trainings that they can ignore which might mean they want to ignore other errors - which I do not want Screenshot 2023-02-02 at 4 33 01 PM It will be nearly impossible and offer very little benefit to go in for literally 100 rows of data and delete the nearby points. Riders will not have a much better experience and I believe some agencies will not want to manage their GTFS! Might you please consider revising this advisory to a warning?

isabelle-dr avatar Feb 02 '23 21:02 isabelle-dr

I am tempted to downgrade this notice, and propose a modification to the spec from:

Values must increase along with shape_pt_sequence

to:

Values must not decrease along with shape_pt_sequence

@bdferris-v2 thoughts?

isabelle-dr avatar Feb 02 '23 21:02 isabelle-dr

Is it possible to configure the rules' severities when using it? (Without rebuilding the .jar/Docker image.)

For my use cases, I'd like to have the option to treat equal_shape_distance_diff_coordinates as an error, even if it is decided here that it should be a warning.

If this is possible already, why not let people who focus on small/rural GTFS providers re-configure equal_shape_distance_diff_coordinates to warning? I'm not saying that their perspective isn't relevant, but I think there is a trade-off to be made, between small providers – who might not have the technical resources to produce high-quality GTFS feeds – and big metropolitan or even national providers – who should encouraged to follow the spec (and its implicit intentions) rather strictly. Because I see this range of sophistication as inevitable, I'd rather opt for more strict defaults.

derhuerst avatar Feb 03 '23 20:02 derhuerst

@isabelle-dr do you have a GTFS data set that exhibits this issue?

KClough avatar Feb 18 '23 17:02 KClough

@KClough I have requested it! I agree with @derhuerst's point of view, this should stay an error in this validator (unless we change the spec).

Is it possible to configure the rules' severities when using it?

We are working on it 🙃

I have the impression that some vendor tools create this issue in a systemic way. It might be worth digging into how shapes.txt is created... An immediate action item might be to update the documentation to explain to users when it's acceptable to ignore this issue.

isabelle-dr avatar Mar 03 '23 15:03 isabelle-dr

Edit on what I've just said: I am not entirely sure error is the right severity, and I'd like to take a data-driven approach to figure this out. @KClough: can you get the list of all the datasets in the Mobility Database that trigger this notice? We can attempt to draw patterns based on what we see in production.

isabelle-dr avatar Mar 03 '23 15:03 isabelle-dr

@KClough here are three datasets that trigger this rule and also the equal_shape_distance_same_coordinates.

Interior Alaska https://www.dropbox.com/s/5es8jexp0qpmsxd/interiorak_google_transit.zip?dl=0

This feed also has warning https://www.dropbox.com/s/h16ny11hlln3k9h/centraltransit_google_transit.zip?dl=0

While this agency has related warning & error https://www.dropbox.com/s/8c8zjbp89fff0do/makah_google_transit.zip?dl=0

And here is Marcy's answer to my question: how to you create the shape files.

Rural agencies without a GIS staff product higher quality GTFS with shapes generated using MyMaps - guided by the stops along the way - in the direction of travel >> see this file

https://www.google.com/maps/d/edit?mid=1WuMlxgYa-NCZLZAuyQdwFhDnq6gvsGk&usp=sharing

and then export the shape as KML , name the shape_id

At times a point is near another point along the route & voila - an error I've tried to search for duplicate values and delete - while with multiple routes there are too many values & this was not a quick fix

isabelle-dr avatar Mar 03 '23 21:03 isabelle-dr

A comment from our slack channel on this issue

Because the shape is a GPS trace, is it possible to quantize the coordinates to the 1m level, ie remove the second coordinate with the same dist

isabelle-dr avatar Mar 24 '23 18:03 isabelle-dr

It looks like a reasonable next step is to see how this user is generating shapes and if the file can be cleaned-up. I can commit to doing this in the next few weeks. @KClough, I'd still be interested to get the list of datasets that trigger this notice to have a closer look

isabelle-dr avatar Mar 24 '23 18:03 isabelle-dr

@isabelle-dr Currently the notice data does not include the lat, long for the affected row. Adding that would be one way of getting all the data needed to test the Mobility Database data in the compiled reports that are run as part of the github actions and we could then potentially use some JSON querying tools to check stats on that ouput. This could be done as a test PR if it's not desirable to add those fields into the official output.

Alternatively or additionally, the Cal-ITP project could potentially query this information for the feeds in their database, if we want to make a request to them, but would be limited to California data.

briandonahue avatar May 30 '23 21:05 briandonahue

After a discussion with @qcdyx, the strategy to solve this issue is:

  • we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same.
  • pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates.
  • plot it on a histogram with frequency on the y and shape_dist_traveled value on the x. Then, assess based on what we see:
    • do we have a clear threshold that has the majority of the values below it?
    • if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning)
    • if so: does this need a spec amendment?

isabelle-dr avatar Aug 02 '23 13:08 isabelle-dr

Do we have agreement on downgrading equal_shape_distance_diff_coordinates from error to warning?

Based on my observation, equal_shape_distance_diff_coordinates happens when two consecutive points are very close. For example, the actualDistanceBetweenShapePoints for previous point (lat 48.36919, long -124.63073) and current point (lat 48.36919, long -124.63074) is 0, so these two consecutive points have equal shape_dist_traveled and but different lat/lon coordinates inshapes.txt. Based on the Haversine formula, the distance between these two points is approximately 0.5702 meters, which is very close to 0. The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

I prefer downgrading equal_shape_distance_diff_coordinates to a warning than searching for other substitute geometry libraries. @isabelle-dr @emmambd

qcdyx avatar Sep 25 '23 18:09 qcdyx

@qcdyx Hey Jingsi! The goal right now is to conduct analytics on when equal_shape_distance_diff_coordinates to make a decision — it's too soon to decide about severity at the moment without doing an evaluation of the Mobility Database feeds that we use in acceptance tests, as specified in the next step heading here

The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.

This is interesting! I believe up to this point we thought that the validator was using the actual shape_dist_traveled points defined by the feeds, not doing any additional interpretation. Let's talk about this more offline and then I'll circle back here to document next steps.

emmambd avatar Sep 25 '23 18:09 emmambd

An update on our approach on this PR: https://github.com/MobilityData/gtfs-validator/pull/1675#issuecomment-1955249386

Current approach is to implement a threshold of 1.11m on distances between shape point pairs for the ERROR (to capture any "same" values that result from precision/rounding issues at 5 decimal places for lat long values) , and create a WARNING for any distances that are less than that.

We plan to include this in the upcoming release, and will only take next steps to make this threshold more permissive if we receive user feedback on it.

emmambd avatar Feb 20 '24 23:02 emmambd

I'm going to close issue based on #1675 and re-open it if there is new community feedback after this release that indicates we should make the threshold more permissive.

emmambd avatar Mar 08 '24 20:03 emmambd