gtfs-editor
gtfs-editor copied to clipboard
Improve pattern inference
The whole idea behind patterns is that they are a generalization of a set of trips. If we have a separate pattern for a school tripper (which makes an extra stop at a school at bell times), that kind of defeats the purpose. However, the current GTFS importer makes a pattern for each unique stop sequence. We can improve this by merging similar patterns. We need only a similarity metric; probably Levenshtein distance or Damerau-Levenshtein distance would be appropriate. Regarding their relative merits, I would lean towards the former, because it seems intuitively that trips ABCD and ACBD are more different than ABCD and ABD (A, B... are stops).
It may make sense to scale the distance by the length of the trip (perhaps the average length of the trips being compared)? A single insertion in a three-stop trip is more significant than in a twenty-stop trip.
Once we upgrade to the new GTFS loader, we can use its pattern detection algorithms to find all of the unique stop sequences, and then only calculate per-pattern distances.