gtfs-realtime-validator icon indicating copy to clipboard operation
gtfs-realtime-validator copied to clipboard

Add tests for big feeds

Open barbeau opened this issue 7 years ago • 8 comments

Summary:

We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds: https://groups.google.com/forum/#!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

  • Dutch feed (http://gtfs.openov.nl/ - apparently OpenTripPlanner instances with 24-32GB of memory are used for this)
    • GTFS - http://gtfs.openov.nl/gtfs-rt/gtfs-openov-nl.zip (~261MB)
    • TripUpdates - http://gtfs.openov.nl/gtfs-rt/tripUpdates.pb (~8.4MB)
    • VehiclePositions - http://gtfs.openov.nl/gtfs-rt/vehiclePositions.pb (~617K)
  • MBTA
    • GTFS - https://cdn.mbta.com/MBTA_GTFS.zip (~13.4MB)
    • TripUpdates - https://cdn.mbta.com/realtime/TripUpdates.pb (~8.6MB)
    • VehiclePositions - https://cdn.mbta.com/realtime/VehiclePositions.pb (~44KB)
  • SEQ (Translink)
    • GTFS - https://gtfsrt.api.translink.com.au/GTFS/SEQ_GTFS.zip (~28MB)
    • Combined (TripUpdates + VehiclePositions) feed - https://gtfsrt.api.translink.com.au/Feed/SEQ (~2.2MB)
  • BART (http://www.bart.gov/schedules/developers)
    • GTFS - http://www.bart.gov/sites/default/files/docs/google_transit_20170325_v3.zip (427KB)
    • TripUpdates - http://api.bart.gov/gtfsrt/tripupdate.aspx (3.1KB - it's small because only 1 stop_time_update per trip)
    • VehiclePositions - BART doesn't have this
  • NYC (but they are likely split by borough)
  • LA Metro (not publicly shared)
  • MTC for SF Bay Area (http://511.org/developers/list/apis/) (According to http://assets.511.org/pdf/nextgen/developers/Open_511_Data_Exchange_Specification_v1.0_Transit.pdf, it doesn't seem that you can pull out more than one agency at a time, so no feed that includes all bay area transit agencies exists)
  • CTA (Doesn't seem to be public? http://www.transitchicago.com/developers/)
  • HART
    • GTFS - http://gohart.org/google/google_transit.zip (~2KB)
    • TripUpdates - http://api.tampa.onebusaway.org:8088/trip-updates (~9KB)
    • VehiclePositions - http://api.tampa.onebusaway.org:8088/vehicle-positions (~9KB)

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

barbeau avatar Apr 12 '17 15:04 barbeau

If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:423)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:800)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
	at org.eclipse.jetty.server.Server.handle(Server.java:497)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:313)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:626)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:546)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:256)
	at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:238)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:486)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:316)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:291)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1140)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:403)
	... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuilder.<init>(StringBuilder.java:89)
	at org.onebusaway.csv_entities.DelimitedTextParser.parse(DelimitedTextParser.java:65)
	at org.onebusaway.csv_entities.CSVLibrary.parse(CSVLibrary.java:131)
	at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(CsvTokenizerStrategy.java:34)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:154)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:172)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:160)
	at com.conveyal.gtfs.validator.json.FeedProcessor.load(FeedProcessor.java:73)
	at com.conveyal.gtfs.validator.json.FeedProcessor.run(FeedProcessor.java:44)
	at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(GtfsFeed.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:308)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

barbeau avatar Apr 12 '17 19:04 barbeau

Here's a good list of GTFS-rt feeds from Transitfeeds.com: http://transitfeeds.com/search?q=gtfsrt

barbeau avatar May 03 '17 21:05 barbeau

Transitland issue for adding support for GTFS-rt feeds - https://github.com/transitland/transitland/issues/77.

barbeau avatar May 08 '17 19:05 barbeau

We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing": https://github.com/CUTR-at-USF/gtfs-realtime-validator#configuration-options

barbeau avatar Sep 19 '17 17:09 barbeau

@barbeau did you try running the out-of-memory dataset using a profiler?

skjolber avatar Mar 11 '18 10:03 skjolber

No, not yet.

barbeau avatar Mar 11 '18 21:03 barbeau

A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here: https://github.com/opentripplanner/OpenTripPlanner/pull/3783

barbeau avatar Dec 14 '21 15:12 barbeau

DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to https://de.data.public-transport.earth/gtfs-germany.zip.

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.

derhuerst avatar Dec 14 '21 16:12 derhuerst