gtfsrouter icon indicating copy to clipboard operation
gtfsrouter copied to clipboard

transfers.txt optional

Open polettif opened this issue 5 years ago • 26 comments

When I try to run extract_gtfs("sample-feed-fixed.zip") with this sample feed I get the error because transfers.txt is missing:

Error in extract_gtfs("sample-feed-fixed.zip") : 
  sample-feed-fixed.zip does not appear to be a GTFS file; it must minimally contain
  calendar, routes, stop_times, transfers, trips

While I'm aware that routing on a feed without transfers might not yield very usable results, transfers.txt is not a required file for a gtfs feed.

polettif avatar May 10 '19 15:05 polettif

Great catch, thanks!! I'll fix asap

mpadge avatar May 10 '19 19:05 mpadge

@polettif those commits fix the main issue, but note the following still occurs with your test feed:

library (gtfsrouter)
filename <- list.files (".", full.names = TRUE, pattern = "sample")
g <- extract_gtfs (filename)
#> Warning: This feed contains no transfers.txt
summary (g)
#> A gtfs object with the following tables and respective numbers of entries in each:
#>          agency  calendar_dates        calendar fare_attributes 
#>               1               1               2               2 
#>      fare_rules     frequencies          routes          shapes 
#>               4              11               5               0 
#>      stop_times           stops           trips 
#>              28               9              11
#> Note: This feed contains no transfers.txt table
gt <- gtfs_timetable (g)
#> Day not specified; extracting timetable for Thursday
#> Error in rcpp_make_timetable(gtfs$stop_times, stop_ids, trip_ids): _Map_base::at

Created on 2019-05-16 by the reprex package (v0.2.1)

And that error is because the feed is in some way malformed, not because there is no "transfers.txt".

mpadge avatar May 16 '19 08:05 mpadge

And that error is because the feed is in some way malformed, not because there is no "transfers.txt".

Hm, it's basically Google's official sample feed (although fixed). The only thing making it malformed is an additional/custom column drop_off_time in stop_times.txt which isn't exactly standard but shouldn't impact the routing, IMO.

polettif avatar May 16 '19 15:05 polettif

Thanks, then I'll re-open as a prompt to investigate further ...

mpadge avatar May 16 '19 15:05 mpadge

Getting the same Error in rcpp_make_timetable(gtfs$stop_times, stop_ids, trip_ids) : _Map_base::at error when I try to run routing. Is there any particular table I should look at to fix?

(This feed works fine on tidytransit, though)

sridharraman avatar Aug 12 '19 12:08 sridharraman

Hi Mark! An example of a valid GTFS without the transfer.txt file, can be found in Vienna (download through this link).

Loading it with the gtfsrouter package goes without problem, and does indeed give a warning.

library(gtfsrouter)
packageVersion('gtfsrouter')
[1] ‘0.0.1.3’
download.file('https://go.gv.at/l9gtfs', 'gtfs.zip')
gtfs = extract_gtfs('gtfs.zip')
Warning message:
This feed contains no transfers.txt 
summary(gtfs)
A gtfs object with the following tables and respective numbers of entries in each:
        agency       calendar calendar_dates         routes         shapes 
             2             98           4305            186         165826 
         stops     stop_times          trips 
          4411        2046111         113219 
Note: This feed contains no transfers.txt table

However, routing with gtfs_route() fails. I don't completely understand the error message, but guess that it is because of the missing transfers.txt file.

gtfs = gtfs_timetable(gtfs, day = 3)
gtfs_route(gtfs, from = "Praterstern", to = "Seestadt", start_time = 12 * 3600)
Error in rcpp_csa(gtfs$timetable, gtfs$transfers, nrow(gtfs$stop_ids),  : 
  Index out of bounds: [index='from_stop_id'].

Is it right that the routing depends on the transfer file, or is there a way to route also without it (and accepting that the result is less usable)?

Full reproducible example:

download.file('https://go.gv.at/l9gtfs', 'gtfs.zip')
library(gtfsrouter)
gtfs = extract_gtfs('gtfs.zip')
gtfs = gtfs_timetable(gtfs, day = 3)
gtfs_route(gtfs, from = "Praterstern", to = "Seestadt", start_time = 12 * 3600)

Edit: this relates also to issue #16 , I see now

luukvdmeer avatar Oct 29 '19 15:10 luukvdmeer

Thanks for the Q and reprex @luukvdmeer, and nice to have you chiming in here. It should be possible to route without a transfers.txt file, even though results may likely be garbage, as you indicate. The package should definitely should do better than the useless error message given above. Thanks for the reprex - i'll report back with a fix asap

mpadge avatar Oct 29 '19 19:10 mpadge

I tried to solve this by just creating a really simple transit table by just linking each stop to itself with the following code:

transfers = gtfs$stops[,1]

transfers$from_stop_id = gtfs$stops[,1]
transfers$to_stop_id = gtfs$stops[,1]
transfers$transfer_type = c(2)
transfers$min_transfer_time = c(180) #3 min of transfer time each stop
transfers$from_route_id = c('')
transfers$to_route_id = c('')
transfers$from_trip_id = c('')
transfers$to_trip_id = c('')

gtfs$transfers = transfers

Afterwards generating a route (e.g. @luukvdmeer 's route from Praterstern to Seestadt) works just fine. The result then looks like this:

route_name        trip_name        stop_name arrival_time departure_time
1          U2 Wien Seestadt      Praterstern     12:07:00       12:07:00
2          U2 Wien Seestadt     Messe-Prater     12:09:00       12:09:00
3          U2 Wien Seestadt           Krieau     12:10:00       12:10:00
4          U2 Wien Seestadt          Stadion     12:11:00       12:11:00
5          U2 Wien Seestadt      Donaumarina     12:13:00       12:13:00
6          U2 Wien Seestadt Donaustadtbrücke     12:14:00       12:14:00
7          U2 Wien Seestadt          Stadlau     12:16:00       12:16:00
8          U2 Wien Seestadt      Hardeggasse     12:17:00       12:17:00
9          U2 Wien Seestadt      Donauspital     12:19:00       12:19:00
10         U2 Wien Seestadt     Aspernstraße     12:20:00       12:20:00
11         U2 Wien Seestadt   Hausfeldstraße     12:22:00       12:22:00
12         U2 Wien Seestadt      Aspern Nord     12:24:00       12:24:00
13         U2 Wien Seestadt         Seestadt     12:26:00       12:26:00

But since this route doesn't really require transfers we still need to validate it with a route that requires transferring. I then tried it with a route between Hauptbahnhof and Seestadt which fails with the following error:

Error in gtfs_route1(gtfs_cp, start_stns, end_stns, start_time, include_ids,  : 
  No route found between the nominated stations

This might have something to do with only linking stops with the same id is not really a good idea, since we have different stops that have the same name but different ids, as shown by:

> rows = grep('.*Hauptbahnhof.*', gtfs$stops$stop_name, perl=T)
> gtfs$stops[rows,]
            stop_id          stop_name stop_lat stop_lon
 1: at:49:1349:0:14   Hauptbahnhof S+U 48.18613 16.37472
 2: at:49:1349:0:17   Hauptbahnhof S+U 48.18618 16.37455
 3:  at:49:1349:0:2   Hauptbahnhof S+U 48.18608 16.37475
 4: at:49:1349:0:24       Hauptbahnhof 48.18512 16.37392
 5: at:49:1349:0:26   Hauptbahnhof S+U 48.18603 16.37531
 6: at:49:1349:0:30   Hauptbahnhof S+U 48.18592 16.37498
 7:  at:49:1349:0:5       Hauptbahnhof 48.18544 16.37337
 8:  at:49:1349:0:6   Hauptbahnhof S+U 48.18613 16.37460
 9:  at:49:1358:0:1   Hauptbahnhof Süd 48.18442 16.37560
10:   at:49:905:0:1 Hauptbahnhof Ost S 48.18451 16.38060
11:   at:49:905:0:2 Hauptbahnhof Ost S 48.18452 16.38049

tuesd4y avatar Oct 31 '19 11:10 tuesd4y

The demo code in #24 from @stmarcin shows the possibility of constructing a workable "transfers.txt" file where none exists. The ability to do this should be incorporated as part of this issue. That code is very inefficient, but just has to be translated into a C++ version and the job will be done.

mpadge avatar Apr 17 '20 09:04 mpadge

Below is some first-cut full code for creating a "transfers.txt" file. The inefficiencies mentioned above have been overcome, but this still takes a long time to calculate (like several tens of minutes), mostly because it downloads the full street network and calculates pedestrian walking times between all pairs of potential transfer locations. The graph included below shows that these are only weakly related to direct (straight-line) distances, suggesting that street-routed walking times really should be used. The question is whether anybody would ever use a function that takes so long to run? On the other hand, there does not appear to be any other software at all that does this, and so it would likely be both useful and important.

This code extends from #24, and uses the Warsaw GTFS discussed there. (Note also that Warsaw is a spatially big city, so most cities would probably not take nearly as long.)

library (gtfsrouter)
library (dodgr)

gtfs <- extract_gtfs("warsaw-gtfs.zip") # pre-downloaded as in issue #24
#> Warning: This feed contains no transfers.txt
#> Warning in data.table::fread(f, integer64 = "character"): Detected 7 column
#> names but the data has 8 columns (i.e. invalid file). Added 1 extra default
#> column name for the first column which is guessed to be row names or an index.
#> Use setnames() afterwards if this guess is not correct, or fix the file write
#> command that created the file to create a valid file.
#> Warning in data.table::fread(f, integer64 = "character"): Discarded
#> single-line footer: <<"Bus shapes (under ODbL licnese): © OpenStreetMap
#> contributors",pl,0,0,1,1,"https://www.openstreetmap.org/copyright">>
stops <- gtfs$stops
d <- geodist::geodist (stops [, c ("stop_lon", "stop_lat")], measure = "haversine")
# join service numbers on to stop table, so we can select only those stops that
# are part of different services

# get service IDs for each stop:
stop_service <- gtfs$stop_times [, c ("trip_id", "stop_id")]
stop_service <- stop_service [!duplicated (stop_service), ]
stop_service$services <- gtfs$trips$service_id [match (stop_service$trip_id,
                                                       gtfs$trips$trip_id)]
stop_service$trip_id <- NULL
stop_service <- stop_service [which (!duplicated (stop_service)), ]

# list of transfer stations within defined distance limit, only including
# stops that are on different services
dist_limit <- 200 # in metres
transfers <- pbapply::pblapply (seq (nrow (gtfs$stops)), function (i) {
                         stopi <- gtfs$stops$stop_id [i]
                         nbs <- gtfs$stops$stop_id [which (d [i, ] < dist_limit)]
                         services <- unique (stop_service$services [stop_service$stop_id == stopi])
                         service_stops <- unique (stop_service$stop_id [stop_service$services %in% services])
                         nbs [which (!nbs %in% service_stops)]
     })
names (transfers) <- stops$stop_id
index <- which (vapply (transfers, function (i) length (i) > 0, logical (1)))
transfers <- transfers [index]
for (i in seq (transfers)) {
    transfers [[i]] <- cbind (from = names (transfers) [i],
                              to = transfers [[i]])
}
transfers <- data.frame (do.call (rbind, transfers), stringsAsFactors = FALSE)
transfers$from_lon <- stops$stop_lon [match (transfers$from, stops$stop_id)]
transfers$from_lat <- stops$stop_lat [match (transfers$from, stops$stop_id)]
transfers$to_lon <- stops$stop_lon [match (transfers$to, stops$stop_id)]
transfers$to_lat <- stops$stop_lat [match (transfers$to, stops$stop_id)]
transfers <- transfers [which (transfers$from != transfers$to), ]

# Get streetnet for pedestrian routing:
stopxy <- stops [, c ("stop_lon", "stop_lat")]
whw <- dodgr_streetnet_sc (pts = stopxy, quiet = FALSE)
net <- weight_streetnet (whw, wt_profile = "foot")
#> Loading required namespace: dplyr
v <- dodgr_vertices (net)
from_ids <- v$id [match_points_to_graph (v, transfers [, c ("from_lon", "from_lat")])]
to_ids <- v$id [match_points_to_graph (v, transfers [, c ("to_lon", "to_lat")])]

# Then get walking times between stops. The input are pairwise lists of
# origins and destinations, which the `dodgr_times` function does not (yet)
# handle, so set network distances to time values to return times instead of
# (default) distances. This will still include effects of waiting times at traffic lights,
# crossing large roads, and stuff like that.
net$d <- net$time
net$d_weighted <- net$time_weighted
system.time ({
    x <- dodgr_dists (net,
                      from = from_ids,
                      to = to_ids,
                      pairwise = TRUE)
})
#>     user   system   elapsd 
#> 4241.448    8.849  588.957

transfers$d_direct <- geodist::geodist (transfers [, c ("from_lon", "from_lat")],
                                        transfers [, c ("to_lon", "to_lat")],
                                        paired = TRUE,
                                        measure = "haversine")
transfers$d <- x [, 1]

# Visualisation:
mod <- summary (lm (transfers$d ~ transfers$d_direct))
r2 <- round (mod$r.squared, digits = 2)
slope <- round (mod$coefficients [2], digits = 2)
library (ggplot2)
theme_set (theme_minimal ())
ggplot (transfers, aes (x = d_direct, y = d)) +
    geom_point (col = "orange", pch = 1) +
    geom_smooth (formula = y ~ x, method = "lm") +
    xlab ("direct distance (m)") +
    ylab ("network-routed time (s)") +
    ggtitle (paste0 ("R2 = ", r2, "; slope = ", slope))
#> Warning: Removed 162 rows containing non-finite values (stat_smooth).
#> Warning: Removed 162 rows containing missing values (geom_point).


# Finally, generate the desired "transfers" table:
transfers <- data.table::data.table (from_stop_id = transfers$from,
                                     to_stop_id = transfers$to,
                                     transfer_type = 2,
                                     min_transfer_time = ceiling (transfers$d))
transfers
#>       from_stop_id to_stop_id transfer_type min_transfer_time
#>    1:       100101      1231m             2               223
#>    2:       100101    1231mE3             2               208
#>    3:       100101    1231mE4             2               192
#>    4:       100101    1231mE5             2               230
#>    5:       100103    1231mE3             2               153
#>   ---                                                        
#> 6372:      1140mE4      1140m             2               110
#> 6373:      1140mE4    1140mP2             2                94
#> 6374:      1140mE4    1140mE1             2               148
#> 6375:      1140mE4    1140mE2             2               173
#> 6376:      1140mE4    1140mE3             2                63

Created on 2020-04-17 by the reprex package (v0.3.0)

Any comments from any of those involved in this discussion (@polettif @luukvdmeer @sridharraman @tuesd4y @stmarcin @tbuckl) would be greatly appreciated. Even better, could you please copy-paste the following bit, and insert :+1: / :-1: / anything else to convey your opinions:


  • [ ] Would it be useful to offer the ability to construct "transfers.txt" tables for feeds that lack these?
  • [ ] Would that will be useful even if it took > 10 minutes for a table to be processed?
  • [ ] Would it be better to just use direct distances, allowing a table to be constructed in < 1 minute or so, but with potentially inaccurate transfer times?

Note in regard to the last one that there is no standard describing how transfer times should be calculated, and the kinds of times calculated above are likely more accurate than those provided by most feeds anyway. Feeds typically just state multiples of 60 or 100 seconds, and the SD around the average slope in the above graph is 64 seconds, so around that value anyway, suggesting direct distances may be good enough?

The street network distance calculations alone take nearly 10 minutes here, with pre-processing of that network also taking several minutes, so the whole thing is likely up towards 20 minutes. Using direct distances, the whole table can be constructed in around 20 seconds.

mpadge avatar Apr 17 '20 14:04 mpadge

Sorry, I missed your edit of the previous post @mpadge.

Very interesting comparison between the direct distance and network distance approach! I agree that the GTFS specification is very vague about how to 'interpolate' transfers when no transfers.txt file is present. From the specification:

When calculating an itinerary, GTFS-consuming applications interpolate transfers based on allowable time and stop proximity. Transfers.txt specifies additional rules and overrides for selected transfers.

Personally, if I had to chose I would prefer the faster solution with euclidean distances. Either way the result will be an approximation. When the transfers.txt file is missing, I accept already that transfers will not be handled 100% correctly. I think I would then prefer a faster, easier solution over a more complicated one that might lead to a (slightly) better approximation in some cases. In other cases the euclidean distance might even be better. In the Vienna GTFS, for example, the U-Bahn and S-Bahn tracks within the same building are modeled as two different stops. Will network routing with OpenStreetMap data then yield correct transfer times, or will it give a transfer time of 0?

However, I do think the network approach is very interesting, and as you said, there seems to be no one offering something like that yet. Therefore, I do think it is a useful option, even if it takes long. You only have to do this step when the GTFS files have been updated, right? Might it be a possibility to give both options? For example by means of a distance_type argument that can be set to euclidean or network? The heavier dependencies like dodgr can then be listed in suggests in the DESCRIPTION file (probably that was already the plan either way).

luukvdmeer avatar May 04 '20 07:05 luukvdmeer

Thanks @mpadge for the comparison (and sorry for the late reaction). Even though the option of calculating network distance distances in case of transfer (and I agree with @luukvdmeer that it would great to have that option), but I think that euclidean distances should be enough. There are more settings that are responsible for transfer time, besides network distances:

  • traffic lights and their duration;
  • delays related to crowds (in can be a real issue)
  • time spend to validate your ticket (in particular when combined with the above)
  • transfer between floors in case of metro and/or trains (e.g. in some stations in Madrid you may spend a couple of minutes between metro stops, even though you don't change your geographical position, nor it is reflected by traveled network distance);
  • (unprecise) location of stop/platform - in some cases, it's a center of the platform, while you only need to reach its beginning, etc.

There is no way to be 100% precise. In the case of Euclidean distances, the biggest issue can be in case of e.g a river or other barrier (2 stops within a threshold, but you can transfer as it is literally not possible due to lack of bridge). So, I would opt for the fastest option (and similarly comparable). Having a network-based option (as for an example I draw above) would be great.

stmarcin avatar Jun 01 '20 20:06 stmarcin

Hey @mpadge sorry for the delay here. Anyway, my thoughts below, for what its worth.

Any comments from any of those involved in this discussion (@polettif @luukvdmeer @sridharraman @tuesd4y @stmarcin @tbuckl) would be greatly appreciated. Even better, could you please copy-paste the following bit, and insert 👍 / 👎 / anything else to convey your opinions:

  • [YES] Would it be useful to offer the ability to construct "transfers.txt" tables for feeds that lack these?
  • [YES] Would that will be useful even if it took > 10 minutes for a table to be processed?
  • [SURE] Would it be better to just use direct distances, allowing a table to be constructed in < 1 minute or so, but with potentially inaccurate transfer times?

For example, could you use this function to build a transfers.txt for the entire bay area using this aggregate feed and then share with a group like Seamless.

tbuckl avatar Jun 22 '20 22:06 tbuckl

I missed this as well, sorry. I haven't used feeds without any transfers yet and when I need to add some missing transfers (which might simply be walks from the initial stop) I simply use euclidian distances with distance als transfer time. Walking times generally are around 1.2 m/s but then again the actual distance on a network is also >1 times the euclidian distance so those factors basically cancel out. Stop pairs that are on the same trip are not considered.

Would it be useful to offer the ability to construct "transfers.txt" tables for feeds that lack these? Would that will be useful even if it took > 10 minutes for a table to be processed?

Generally I'd say it's useful and I guess process time is not that much of an issue because this is a step you generally (hopefully) only need to do once.

Would it be better to just use direct distances, allowing a table to be constructed in < 1 minute or so, but with potentially inaccurate transfer times?

I'd say this should be the focus as it's not depending on additional network data and can be calculated fast and easily from the feed itself. Also the method's shortcomings are immediately obvious and can be taken into consideration. Compared to a network approach where results might seem perfect (especially with transfer times in seconds) and problems might be harder to spot and debug.

Just my two cents and I think a function to create transfers using an actual network would be a useful tool, regardless of (initial) process time. As others have said, making it easy to apply it only to a subset of "difficult" stops might be useful.

polettif avatar Jun 23 '20 07:06 polettif

Thanks for the feedback @tbuckl and @polettif. It shall be done! I'll likely implement a default with some kind of network_distances = FALSE, to enable the full network distances when and where desired. And your Q @tbuckl:

For example, could you use this function to build a transfers.txt for the entire bay area using this aggregate feed and then share with a group like Seamless.

Yeah, absolutely. It would be even better if we could work together on that, so that you would effectively be providing an external test and validation of the procedures. That said, this would be a hell of a challenge because that is such a huge area. But i imagine it would be possible by sub-selecting smaller areas in which adjacent agencies overlap, calculating transfers tables for those, and then aggregating the whole thing at the end. It'll be an interesting case study. Maybe we could start a separate repo to document the process to enable the general procedures to be (1) well documented; and (2) readily adapted by others.

I'll flag all relevant commits with this issue number, so you'll see progress as it happens by watching this space.

mpadge avatar Jun 23 '20 07:06 mpadge

That commit implements a new gtfs_transfer_table() function which works like this:

library (gtfsrouter)
library (dodgr)
if (!file.exists ("warsaw-gtfs.zip"))
    download.file("https://mkuran.pl/feed/ztm/ztm-latest.zip", "warsaw-gtfs.zip")

gtfs <- extract_gtfs("warsaw-gtfs.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
#> ▶ Extracting GTFS feed
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress =
#> FALSE): Detected 7 column names but the data has 8 columns (i.e. invalid file).
#> Added 1 extra default column name for the first column which is guessed to be
#> row names or an index. Use setnames() afterwards if this guess is not correct,
#> or fix the file write command that created the file to create a valid file.
#> Warning in data.table::fread(flist[f], integer64 = "character", showProgress
#> = FALSE): Discarded single-line footer: <<"Bus shapes (under ODbL licnese):
#> © OpenStreetMap contributors",pl,0,0,1,1,"https://www.openstreetmap.org/
#> copyright">>
#> ✔ Extracted GTFS feed
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

if (!file.exists ("warsaw-hw.Rds")) {
    net <- dodgr_streetnet_sc (pts = gtfs$stops [, c ("stop_lon", "stop_lat")])
    saveRDS (net, "warsaw-hw.Rds")
} else
    net <- readRDS ("warsaw-hw.Rds")

system.time (  
    tt_direct <- gtfs_transfer_table (gtfs, network_times = FALSE)
)
#> Loading required namespace: geodist
#> ▶ Finding neighbouring services for each stop
#> Loading required namespace: pbapply
#> ✔ Found neighbouring services for each stop
#> The following stages may take some time. Please be patient.
#>    user  system elapsed 
#>  18.285   0.606  18.330
system.time (  
    tt_network <- gtfs_transfer_table (gtfs, network = net, network_times = TRUE)
)
#> ▶ Finding neighbouring services for each stop
#> ✔ Found neighbouring services for each stop
#> ▶ Contracting street network ... 
#> Loading required namespace: dplyr
#> ✔ Contracted street network
#> ▶ Calculating transfer times between 864 pairs of stops
#> ✔ Calculated transfer times between 864 pairs of stops
#>    user  system elapsed 
#> 473.337   2.601 219.809

tt_direct
#>       from_stop_id to_stop_id transfer_type min_transfer_time
#>    1:       100101      1231m             2               172
#>    2:       100101    1231mE3             2               156
#>    3:       100101    1231mE4             2               134
#>    4:       100101    1231mE5             2               171
#>    5:       100103    1231mE3             2               162
#>   ---                                                        
#> 6372:      1140mE4      1140m             2               112
#> 6373:      1140mE4    1140mP2             2                99
#> 6374:      1140mE4    1140mE1             2               156
#> 6375:      1140mE4    1140mE2             2               161
#> 6376:      1140mE4    1140mE3             2                25
tt_network
#>       from_stop_id to_stop_id transfer_type min_transfer_time
#>    1:       100101      1231m             2               223
#>    2:       100101    1231mE3             2               208
#>    3:       100101    1231mE4             2               192
#>    4:       100101    1231mE5             2               230
#>    5:       100103    1231mE3             2               153
#>   ---                                                        
#> 6372:      1140mE4      1140m             2               110
#> 6373:      1140mE4    1140mP2             2                94
#> 6374:      1140mE4    1140mE1             2               148
#> 6375:      1140mE4    1140mE2             2               173
#> 6376:      1140mE4    1140mE3             2                63

Created on 2020-06-24 by the reprex package (v0.3.0)

Those are then the two sets of data plotted above, but with direct distances converted to times. The relationship looks exactly the same, but in this case the slope is 0.93, so direct times are on average 93% of network times. (And note that the network times can be slightly less, because they are assumed to begin and end at closest nodes of street network, and distance between these may indeed be less than between stated coordinates of GTFS feed.)

The code is pretty much as efficient as can be, I think, and results in transfer tables from direct distances taking < 20 seconds, and network distances (using pre-downloaded network) taking < 4 minutes. The area covered by the Warsaw GTFS feed really is huge, extending across 70km or so, so 4 minutes to calculate an accurate transfer table seems quite okay.


All involved here: Please try out this function and report back

Thanks! :+1:

I'll now re-open to ensure this is tested and documented. @tbuckl Maybe we could develop an initial vignette for the SeamlessBayArea case to demonstrate this functionality, and go from there?

mpadge avatar Jun 24 '20 10:06 mpadge

The GTFS for Berlin illustrates one interesting aspect of transfer tables:

library (gtfsrouter)
gtfs <- extract_gtfs ("vbb.zip") # local copy of full feed
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

fxy <- gtfs$stops [match (gtfs$transfers$from_stop_id, gtfs$stops$stop_id), ]
fxy <- fxy [which (!duplicated (paste0 (fxy$stop_lon, "==", fxy$stop_lat))), ]
txy <- gtfs$stops [match (gtfs$transfers$to_stop_id, gtfs$stops$stop_id), ]
txy <- txy [which (!duplicated (paste0 (txy$stop_lon, "==", txy$stop_lat))), ]
d <- geodist::geodist (fxy, txy, paired = TRUE)
table (d)
#> d
#>     0 
#> 11902

And "transfers.txt" entries are only for stops at identical locations.

hist (gtfs$transfers$min_transfer_time [gtfs$transfers$min_transfer_time < 1000],
      xlab = "transfer time", main = NULL)

The stated transfer times are mostly quite short, although do extend up to 5 minutes or so. The following line uses the new function to create an equivalent table, with the min_transfer_time = 0 parameter causing all in-place transfers to be given a time of 0 (instead of the values in the original table as shown above).

nrow (gtfs$transfers)
#> [1] 95329

transfers <- gtfs_transfer_table (gtfs, d_limit = 200, min_transfer_time = 0)
#> ▶ Finding neighbouring services for each stop
#> Loading required namespace: pbapply
#> ✔ Found neighbouring services for each stop
#> ▶ Expanding to include in-place transfers
#> ✔ Expanded to include in-place transfers
nrow (transfers)
#> [1] 131606

Including transfers between nearby locations adds quite a few possibilities, but ...

hist (transfers$min_transfer_time, xlab = "transfer time", main = NULL)

Created on 2020-06-24 by the reprex package (v0.3.0)

Almost all transfers remain in-place, and the resultant transfer times are generally even shorter than the original values given for in-place transfers (compare scales of plots). While this would seem to suggest that transfer tables are actually best constructed with local knowledge, this is both a pointless observation for any attempt to develop a general means to derive these data, and also perhaps not particularly valid, because these patterns may be unique to Berlin. Certain @tbuckl's example of the greater Bay Area relies on transferring between various service providers, and these could surely be presumed almost never to share stops at identical locations?

mpadge avatar Jun 24 '20 11:06 mpadge

@mpadge with the fragmented transit system in the bay, there are lots of bad transfers, and the timing between operators is occasionally not optimized. but, one thing i'm concerned right now is that schedules are massively disrupted in the US right now. for example, SF's transit system hasn't been running trains for months and won't resume for months. i'll take a quick look this weekend at the regional schedule from MTC/interline and see what the calendar file looks like before proceeding.

tbuckl avatar Jul 01 '20 04:07 tbuckl

First of all, thanks a lot for all your effort! This has already really been very helpful.

I also tried out the new implementation to create the transfers.txt

Tl;dr: my main findings were:

  • in general it works good
  • main issues: transfers time too short / not enough penalty for transfers
  • some transfers with too large walking distance are not included

Conclusion: In my opinion the direct approach is enough to create the transfers.txt. In order to get more realistic transfer information and travel times, heuristics would help better, than the actual routing between stop locations (e.g. when changing to a Regio more transfer time is needed. Maybe even larger penalties for using the Regio within the city? Or: changing from U-/S-Bahn to Bus takes needs larger transfer times). Also: reconsider the max radius for transfering (maybe more than 200m?)

In detail: In my opinion the main advantage of a "network_distance" transfers approach would be to make sure, that only reasonable transfers are included (and exclusion of e.g. short air line but no connecting road, or river inbetween, etc.) rather than getting perfectly accurate transfer times. As you said - those can probably only be achieved with domain knowledge.

Therefore I wanted to test, how both approaches compare to the benchmark of the original transfers.txt and OTP.

I used the Berlin Feed: From one start point I routed to all stops (as the implemention for multiple "to" stations in gtfs_route is still missing, I used the gtfs_isochrone function with 5 min intervals from 5 to 60 min).

Three variants:

original: orginal transfers.txt direct: new implementation with direct distances net: new implementation with routing to get transfers times

library(gtfsrouter)
library(dplyr)

gtfs_raw <- extract_gtfs ("gtfs.zip")

# within Berlin:
berlin_extent <- st_read('{
"type": "FeatureCollection",
"name": "extent_4326",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { }, "geometry": { 
"type": "Polygon", "coordinates": [ [ 
[ 13.00741692685955, 52.311211358823435 ], 
[ 13.036276188802013, 52.732549914571045 ],
[ 13.85849194732964, 52.70886262115453 ],
[ 13.821850448156555, 52.28778160972464 ],
[ 13.00741692685955, 52.311211358823435 ] ] ] } }
]
}
')
stops <- gtfs_raw$stops%>%
  st_as_sf(coords = c("stop_lon", "stop_lat"), crs = 4326)

stops <- st_intersection(stops, berlin_extent) %>%
  select(.data$stop_name, .data$geometry) %>%
  distinct(.keep_all = T)


#### get all traveltimes rounded to 5 mins for all stops reachable in 60 min
traveltimes_per_stop <- function(gtfs) {
  
  iso <- NA
  
  for (i in seq(5, 60, 5)) {
    print(i)
    iso_temp <- gtfs_isochrone(gtfs, "Alexanderplatz", day = "Tuesday", 
                               start_time = 8 * 3600,
                               end_time = 8 * 3600 + (i * 60)) 
    
    iso_temp_stations <- iso_temp$mid_points %>% rbind(iso_temp$end_points) %>% tibble::add_column (time = i)
    if (!is.na(iso)) iso <- iso %>% rbind(iso_temp_stations) else iso <- iso_temp_stations
  }
  
  ttimes<- iso %>%
    group_by(stop_id) %>%
    summarise(time = min(time)) %>%
    left_join(gtfs$stops, by = "stop_id")
  
  return(ttimes)
  
}

#### original

gtfs_original <- gtfs_timetable (gtfs_raw, day = "Tuesday")
if (!file.exists ("berlin-original-ttimes.Rds")) {
  ttimes_original <- traveltimes_per_stop(gtfs_original)
  saveRDS (ttimes_original, "berlin-original-ttimes.Rds")
} else
  ttimes_original <- readRDS ("berlin-original-ttimes.Rds")


#### direct

if (!file.exists ("berlin-direct-ttimes.Rds")) {
  gtfs_direct <- rlang::duplicate(gtfs_raw)
  gtfs_direct$stops <- gtfs_direct$stops %>% select(stop_id, stop_lat, stop_lon, stop_name, parent_station)
  tt_direct <- gtfs_transfer_table (gtfs_direct, network_times = FALSE)
  gtfs_direct <- within(gtfs_direct, rm(transfers))
  gtfs_direct$transfers <- tt_direct
  gtfs_direct <- gtfs_timetable (gtfs_direct, day = "Tuesday")
  ttimes_direct <- traveltimes_per_stop(gtfs_direct)

  saveRDS (gtfs_direct, "berlin-gtfs_direct.Rds")
  saveRDS (ttimes_direct, "berlin-direct-ttimes.Rds")
} else {
  ttimes_direct <- readRDS ("berlin-direct-ttimes.Rds")
  gtfs_direct <- readRDS ("berlin-gtfs_direct.Rds")
}

### net
if (!file.exists ("berlin-net.Rds")) {
  gtfs_net <- rlang::duplicate(gtfs_raw)
  gtfs_net$stops <- gtfs_direct$stops %>% select(stop_id, stop_lat, stop_lon, stop_name, parent_station)
  net <- dodgr_streetnet_sc (pts = st_coordinates(stops))
  net_transfers <- gtfs_transfer_table (gtfs_net, network = net, network_times = TRUE)
  gtfs_original <- within(gtfs_original, rm(transfers))
  gtfs_original$transfers <- tt_original
  gtfs_original <- gtfs_timetable (gtfs_original, day = "Tuesday")
  ttimes_original <- traveltimes_per_stop(gtfs_original)
  
  saveRDS (net, "berlin-net.Rds")
  saveRDS (net_transfers, "berlin-net-transfers.Rds")
  saveRDS (ttimes_original, "berlin-original-ttimes.Rds")
} else {
  ttimes_original <- readRDS ("berlin-original-ttimes.Rds")
}

1. Comparison direct and net: with rounded travel times (to 5 min) there are almost no differences in the travel times.

grafik

grafik

The more significant difference is between the original transfers.txt implementation and the new apporoach.

2. Comparison of direct and original:

There are stops that are reached faster with the orginial transfers.txt and others that are faster with the new direct implementation. (negative: originial is faster, positive: direct is faster) grafik

For the most part pretty similar, but on average the direct connections yield faster traveltimes (mean: 2.655) grafik

Where do those differences come from? Looking into single routes shows, that the direct implementation establishes transfers, that are not listed in the transfers.txt

E.g. From "Alexanderplatz" to Spandau "Neue Bergstr." grafik

A transfer from the RE train to the RB10 is found by the direct connection, which is not listed in the transfers.txt grafik

grafik

On the other side, transfers that are listed in the transfers.txt but have air lines > 200 m are not listed as transfers in the new approach.

E.g: grafik grafik

grafik

3. Comparison to OTP

On average the new direct implementation is closer to the OTP results than the transfers.txt. (Mean difference: 0.5) while the routing with the original transfers.txt yields longer travel times on average (-2.3)

Comparison OTP and direct:

grafik grafik

Comparison OTP and original: grafik grafik

Reasons for different times with OTP are usually:

  • OTP faster: transfers are made that are not seen by the gtfs_router
  • OTP slower: larger transfer times so that some connections are not reached

I hope this was somehow a help. If you have any questions about that let me know. Also happy to share the code - just didnt want this to get even longer and confusing.

AlexandraKapp avatar Jul 06 '20 13:07 AlexandraKapp

Thanks so much for that really detailed consideration @AlexandraKapp - some very helpful insights there. I'd say there are at least two aspects that your analyses suggest could/should be addressed:

  1. More careful consideration of "parent station" relationships in constructing "transfers.txt", and
  2. Allowing a possibility of user-specified hierarchical transfer times based on agency ID values. This would enable, as you suggest, different transfer times from, for example, bus to train, or from local to regional trains. Could easily be done to default to min_transfer_time, but use specified values where given between pairs of agency IDs.

I'll dig into both of those as soon as i have a chance.

mpadge avatar Jul 07 '20 09:07 mpadge

The agency ID approach might only work to a limited degree, as the same agency might have different mode of transports (e.g. BVG in Berlin has busses und U-Bahn). In Madrid for example the agency in the gtfs feed is the same for all modes. Maybe the route_type (routes.txt) could be used? Either to raise the min_transfer_time for any change of mode, or to implement specific times for specific mode changes (might be a bit tidious).

let me know if I can help with anything :)

AlexandraKapp avatar Jul 07 '20 11:07 AlexandraKapp

another issue that popped up with the custom transfers.txt:

for some stops and times all stops are within reach of 5 minutes. Though, only with the custom transfers.txt, not with the original one:

E.g. with the VBB Feed:

gtfs <- extract_gtfs ("VBB_gtfs.zip")

# this works as expected:
i <- gtfs_isochrone(gtfs, "Boxhagener Platz", day = "Monday", start_time = 8*3600 + 5*60, end_time = 8*3600 + 10*60)

# isochrone with custom transfers
gtfs_cp <- copy(gtfs)
gtfs_cp$stops <- gtfs_cp$stops %>% select(stop_id, stop_lat, stop_lon, stop_name, parent_station) # workaround for bug: # find_xy_cols(obj) :   Unable to determine longitude and latitude columns; perhaps try re-naming columns. "
transfers <- gtfsrouter::gtfs_transfer_table (gtfs_cp, network_times = FALSE)
gtfs_cp$transfers <- transfers

# this produces the implausible isochrone:
i_cust_trans <- gtfs_isochrone(gtfs_cp, "Boxhagener Platz", day = "Monday", start_time = 8*3600 + 5*60, end_time = 8*3600 + 10*60)

nrow(i$mid_points) # returns 6
nrow(i_cust_trans$mid_points) # returns 122233

# with 5 minutes earlier start time this works just fine:
i_cust_trans2 <- gtfs_isochrone(gtfs_cp, "Boxhagener Platz", day = "Monday", start_time = 8*3600, end_time = 8*3600 + 5*60)
nrow(i_cust_trans2$mid_points) # returns 13

AlexandraKapp avatar Jul 28 '20 09:07 AlexandraKapp

Thanks @AlexandraKapp, but unfortunately i can not reproduce that. And that reflects one important restriction of the GTFS format -- there is no versioning system, and no way of knowing which version you are using, and which version i am using. I'll update my version and see if i can reproduce it, but short of that, may not be able to do much. Maybe you could email me a link to your actual version (not to the site from where you obtained it, because that may have changed), and i could then see whether i could reproduce it?

mpadge avatar Jul 28 '20 18:07 mpadge

Whoops, meant to close #38 with that commit, not this issue. Still have to check all of the stuff raised here to ensure it works before closing.

mpadge avatar Aug 13 '20 10:08 mpadge

This all seems good now, but I'll keep it open until I've written a new vignette describing the functionality of gtfs_transfer_table()

mpadge avatar Aug 13 '20 11:08 mpadge

Thank you to everybody who contributed to this issue, and in doing so really helped to improve the package a lot! The whole package is slowly being entire reprogrammed from the inside out, starting with the new gtfs_traveltimes() function which got debugged extremely thoroughly by @AlexandraKapp , and has finally make it on to CRAN v0.0.5. The gtfs_transfer_table() function has been fully implemented, but is barely documented at all in current form. Dedicated vignette issue is now #85, with this issue staying open until that is complete. Finally, note the label on that issue of "Help Wanted" - please keep an eye on that, and contribute wherever and whenever any of you can! Thanks again! mark

mpadge avatar Jun 11 '21 09:06 mpadge