unconf16 icon indicating copy to clipboard operation
unconf16 copied to clipboard

R package (gtfsr) to make working with GTFS (transit data) feeds easy

Open eamcvey opened this issue 8 years ago • 17 comments

GTFS is a standard format for transit data (routes, stops, schedules, etc.). [There is also a real-time version of GTFS - I'm considering it out of scope for now] https://developers.google.com/transit/gtfs/ If it's easy to work with GTFS data in R, it will facilitate the creation of more sophisticated analysis tools for transit systems built on top of this package.

Get feed data into R:

  • pull in GTFS feed by agency via an API (transitfeeds.com)
  • or read in GTFS from a downloaded file -define a sensible object or objects with appropriate data types (spatial points for stops, date-times formatted, etc.)
  • put the feed data into said object(s)

Validate feed and assess data quality: (which is often poor)

  • report any required variables that are missing
  • report variables that are included but not defined as required or optional
  • report variables whose contents are invalid (date-time that can't be formatted as date-time, etc.)
  • report on ID problems (ex. trip_id in one file (table) doesn't match trip_id in another)

Provide convenience functions for common tasks:

  • map stops
  • select one route and map it
  • summarize a transit system: # of routes, etc.
  • more -- need to do some discovery work to figure out what these should be

Facilitate creation of a GTFS feed from within R:

  • not sure how high a priority this is?

eamcvey avatar Mar 04 '16 15:03 eamcvey

The ability to compare two versions of a gtfs feed from an agency and be shown the differences could be useful -- i.e. to see what changes a transit agency made.

eamcvey avatar Mar 04 '16 20:03 eamcvey

I've been working at the intersection of R and GTFS for bit. I'm glad I came across this! I would be happy to try and push to any part of this project or lead it myself if no one has time.

In Anchorage, AK we use a script that we run continuously during the working hours of our bus line People Mover to generate the protocol buffer that is needed for the GTFS-FS.

https://github.com/codeforanchorage/api-realtime-bus

We run build_protobuf.R in the R directory continuously. I'm sure there is many things inelegant on how I wrote it but I wanted to include it as an example on how we generate the GTFS-RT feed using Dirk E's RProtoBuf package. We are using a feed from the existing vender service that calculates the delays by stop for us which takes out some of the brainy part of our project.

Thanks for exposing transitfeeds.com to me. Before I was looking at the GTFS Exchange which is another resource.

I've make a little script to process stops.txt and shapes.txt into sp objects before pushing them to PostGIS, which I think is the best platform to imo. Open to anything though. https://gist.github.com/hansthompson/a3d2c710ac8e3584d58. The bits inside this gist that convert them to shapefile using WriteOGR could be useful though.

If the PostGIS seems like a good way forward, I would need some help addressing the three concerns I see for this kind of conversion with GTFS that I list at the top of the Gist.

  1. It would be great to use some service to find the best state plane projection (or other) for accurate metric measure of distance instead of WGS84.
  2. I'm not sure what kind of time data type could be used for stop_times.txt that would account for the time of day but including time that goes around the clock past 24 hours. (PLEASE SOMEONE WHO KNOWS POSTGRES HELP!)
  3. I'm not yet so good at postgres admin stuff so how could this be created temporarily without admin privileges?

On the not on checking errors within the GTFS feed, the google dash is pretty excellent if you want to throw a gtfs feed against it in testing mode. It would be nice to try it outside the google platform though. I would like a mapping function using leaflet for the testing that could show the routes and the expected positions of buses at a specified time during the day.

I'm also interested in getting the network analysis involved to show the network (maybe in a given time window?) And also showing the network analysis parameters spatially once its done. Here's a pretty rough idea. http://akdata.org/misc/gtfs_network.html. I'm taking a course on network analysis currently and would love to make this a end of semester project that could be generalized to any GTFS.

Finally, perhaps outside the borders of this project is creating a delay analysis package that could take the GTFS and the real time gps data in some standard format to build a protobuf server to scale real-time updates for google for anywhere there is A. GTFS and B. gps on board.

hansthompson avatar Mar 09 '16 19:03 hansthompson

@rustyb has a package called GTFSr that might be a good resource to build off of as well.

https://github.com/rustyb/GTFSr

hansthompson avatar Mar 10 '16 00:03 hansthompson

@hansthompson Thanks, I'm checking out GTFSr! And thanks for all the information, I am digesting it. It would be great to be able to build on existing stuff.

eamcvey avatar Mar 10 '16 16:03 eamcvey

@hansthompson The list to the gist you provide appears to be broken (or I don't have access?)

eamcvey avatar Mar 10 '16 18:03 eamcvey

Sorry. I'm new to Gists. Try this one.

https://gist.github.com/hansthompson/a3d2c710ac8e3584d58c

hansthompson avatar Mar 10 '16 20:03 hansthompson

I can't get the GTFSr vignette to compile. If you get it working would you mind sharing a copy?

hansthompson avatar Mar 10 '16 21:03 hansthompson

Howdy Folks - Thanks for the interest in GTFSr and my apologies for not getting back to you sooner. GTFSr was a wee project for an R course in college.

I've a funny feeling I might not have the actually working version on github. Will dig it out on my machine and get it working again tomorrow.

rustyb avatar Mar 11 '16 11:03 rustyb

Just wanted to make a plug for a package I started for network analysis of GTFS this weekend.

https://github.com/hansthompson/gtfsnetwork

It will convert the GTFS files into an edge list and do some filtering by time and service id.

I'm not sure how to write packages for S4 objects though so I just read in the files as seperate data.frames. What are your thoughts of this @eamcvey and @rustyb ?

hansthompson avatar Mar 14 '16 19:03 hansthompson

@hansthompson Things like this network analysis are exactly what I hope would be built into/on top of the package I was envisioning. At minimum, the package should make it easy to get GTFS feeds, assess the quality of the data, save it in useful gtfs object, and make it convenient to do the types of joins that would be most common. I have a start on some of these features that I'll put into a public repo by the end of the week. Then ideally getting the data to the starting point for network analysis is very easy, and you can focus on the network part.

eamcvey avatar Mar 23 '16 15:03 eamcvey

Cool. I'll look forward to it! What are your thoughts on an rmarkdown like output of the feed validation with charts that show when service ids run and maps of the stops, etc?

hansthompson avatar Mar 25 '16 20:03 hansthompson

@eamcvey & @hansthompson Great discussion thus far. I would like to jump in too. I was wondering if the public repo that @eamcvey planned to create was ready. You could outline some specific tasks that we can start working on.

Emaasit avatar Mar 30 '16 01:03 Emaasit

Better late than never - the code I've started on is finally in a public repo here: https://github.com/ropenscilabs/gtfsr I've got the basic functionality to pull feeds from the transitfeeds.com API, putting all the feed data into a list of dataframes (not yet a class, because I'm not sure what level of validation there should be), and creating a validation dataframe as part of that list to start characterizing the data quality of the feed. There is more to be done on data validation (checking that the ids in different data frames match up where they should, for example), thinking to do about what the gtfs object should look like (maybe adapting existing code referenced in this discussion), and lots that could be built on top of this. I have a driver file in the repo that I used to test things out, and there are some functions in there I wrote on the fly that should get formalized.

eamcvey avatar Mar 31 '16 03:03 eamcvey

4 Main Purposes of the Package

  1. provides API wrappers for popular public GTFS feed sharing sites,
  2. reads feed data into a gtfs data object,
  3. validates data quality,
  4. provides convenience functions for common tasks

convenience functions for common tasks may include;

  1. how to calculate fares,
  2. how to search for trips,
  3. how to optimize feed data

Emaasit avatar Mar 31 '16 07:03 Emaasit

I started to put together my own package to handle the GTFS-realtime feeds - https://github.com/SymbolixAU/gtfsway

It uses the RProtoBuf package to load the .proto file in .onLoad(). Then the gtfs_realtime() function reads the binary result of a gtfs real-time response (although at the time of writing this it doesn't do anything with the data, I'm still working on it). For example, the realtime-feed for South East Queensland can be downloaded by

## south east Queensland
url <- "https://gtfsrt.api.translink.com.au/Feed/SEQ"
response <- httr::GET(url)

If you want I can make this into a 'formal' function and issue a PR to incorporate it into gtfsr ?

SymbolixAU avatar Jan 01 '17 22:01 SymbolixAU

I'm really glad to read this thread and see more people are interested in using R to do network analysis of GTFS datasets. I hope to contribute more with the project in the future. For now, I share a similar initiative using Java, which can bring some useful insights. It was created by Tyler Green .

http://www.tyleragreen.com/blog/2017/03/graphing-transit-systems-part-ii-centrality/

rafapereirabr avatar Mar 14 '17 16:03 rafapereirabr

@eamcvey Nice work! Do you know if there are plans to bring that package into CRAN?

bbrewington avatar Mar 16 '17 22:03 bbrewington