public-transit-tools icon indicating copy to clipboard operation
public-transit-tools copied to clipboard

Processing Large Batches of GTFS & Minor Trivialities

Open d-wasserman opened this issue 5 years ago • 8 comments

Hi Melinda,

Per our conversation, I am revisiting the repo again with eyes towards processing dozens of GTFS data sets. My general question is what can we do to have the pre-process GTFS tool "fail gracefully" as an option for processing. My idea here is the tool would just skip GTFS data sets it can't process, and interpolate (in pandas) stop times in CVS's where it is required. I am thinking about a boolean that just be called "Best Attempt Pre-Processing" (or similar). Having to process 50+ GTFS datasets and deal with the steps for interpolation and checking is pretty painful.

  • Can we more efficiently handle interpolation in the pre-processing stage as an optional boolean for all import directories?

  • Does the db.commit() function need to move position to "fail gracefully"?

  • Is it possible to create a "fake" calendar file that uses "calendar_date" to get a general sense of whether a typical Weekday might work? Right now if there is a missing calendar file that GTFS Directory is just not processed with that batch. I think we could make some assumptions on a calendar_date.txt file if that is all they have. Another idea is if a general Monday is requested, but there is no calendar file, the script searches for the first Monday in Calendar Dates. I think this could be done with Datetime objects.

Thoughts on this? I know we have discussed it before. I will volunteer code.

d-wasserman avatar Apr 23 '19 02:04 d-wasserman

Is this for BetterBusBuffers or for AddGTFS, or both?

If for AddGTFS, the new stuff in Pro will make it easier to catch a specific error message because it will have a real number. However, the problem of processing lots of datasets still stands. It wasn't practical in the first release to do anything fancy.

So, for your purposes, you just want to use as many datasets as are good and tell you which got skipped because they were bad, is that correct? (And automatically correct the ones that were bad, but that's kind of a separate operation.)

Is there value in having some sort of append tool so you could correct the bad ones and then append them to the pile?

mmorang avatar May 20 '19 19:05 mmorang

I would like to say both, and I speak from not knowing entirely the changes on GTFS processing were made there. However, there is a need to process datasets in mass across large scales. The database is not the issue, the inconsistencies are. Offline I can tell you more about the issues I ran into processing the entire state of California (I got close), but effectively my issues boil down to a lack of a calendar.txt and having to make ad-hoc processes for interpolation.

My idea is effectively:

  • GTFS Processing step should have the option to deal with the "best first cut interpretation" of the data so that:
  1. Stop times are interpolated if needed - there is not separate process, it is built into the database prep step.
  2. Calendar date files are read and processed to provide a "best guess" calendar.txt file if it is missing one. If it has neither a calendar_date.txt file or a calendar file, it creates one with weekday only service for all routes (however it would have to be invalid GTFS for that to happen). The idea here is that a weekday option should always be available as a promise by the tool set.
  3. Any other issues you see that could be an issue should also be addressed.

d-wasserman avatar May 20 '19 20:05 d-wasserman

The new Pro stuff puts all the transit data in a set of file geodatabase tables. It does not use sqlite at all. I would like to make BetterBusBuffers use these tables at some point. Realistically, I probably am not going to make any major changes to the existing sqlize_csv code because that is old old old and not very good.

The new Pro stuff has a tool similar to PreprocessGTFS, but it converts the GTFS data into the new data model (set of file geodatabase tables). That tool will fail as soon as it hits a bad dataset. I realize this isn't ideal in many circumstances, but this was kind of a first cut for the first release, and we plan to revisit this in the future.

There are other issues at stake having to do with how well our new implementation will scale realistically anyway. I'm not sure that a single dataset for all of California is something we want to recommend doing. Instead, make several separate datasets and have your network service or script or whatever do some kind of brokering based on the extent of the inputs so it can select the correct database. This is how the ArcGIS Online services work behind the scenes right now - if your inputs are in North America, it picks the North America network dataset, but if they're in Europe, it picks the Europe network dataset. Anyway, we plan to collect feedback from users (like you) to find out what's really needed.

Everything you've said makes sense and deserves more thought. I like the idea of not failing the tool when it hits a bad dataset. The tool could just skip the bad one and throw a warning, and maybe there could be a boolean so the user can turn this behavior on and off. And it should fail if no datasets were successfully converted.

As for automagically interpolating stop_times or making a best-guess calendar...that's even more fancy. The best way I see to do that would be to make one big script wrapping several parts: a) run the tool and throw warnings for bad datasets, b) for each bad dataset, depending on what the warning was, do the interpolation or fix the calendar, and c) for each bad dataset, re-process it and append it to the original output (assuming there is some kind of append tool).

mmorang avatar May 20 '19 21:05 mmorang

Question for you (only tangentially related): When doing analysis, for your purposes, do you need the the ability to choose any valid date from within the GTFS's valid period, or are you usually doing analysis for only a single date or a short date range? The reason I ask is I'm wondering if it would be valuable for some users to cut down on the size of the data we need to store by storing only that data relevant for a particular date or small range of dates rather than the entirety of the data stored in the GTFS. If I only care about running analysis on Mondays, I can remove all data relevant only to the other days of the week.

mmorang avatar May 20 '19 22:05 mmorang

Comment 1: Got it. I agree.

From my perspective, integrating multiple GTFS data-sets should be your problem and not mine, but currently it is mine. That said, this is no different from the tools currently operate.

In the case of BetterBusBuffers specifically, I think there is a need for an integrated database that scales at least to a state level. I am thinking about pre-screening for TOD (SB 827 for example ) and even environmental impact exemptions such as SB 743 in CA. We know the technology can process at that scale in pro (see below), but it is the other issues I mentioned that are a problem. We can debate on whether not we could or should on the network analysis (should cough), but I think the main bottlenecks are data inconsistency rather than database size. pastedImage_2

We can start with better failure handling, and work our way to interpolation ("figured out") and then calendar management (TBD).

Actually, that is kind of what my script for California did...I will send it your way once I get permission. Comment 2: Generally, we typically need weekday vs. weekend service. We evaluate specific dates typically to avoid double counting, but that I think that should solved at the developer end. Removing extraneous data could work in many cases.

d-wasserman avatar May 20 '19 22:05 d-wasserman

Okay, my current plan is a set of enhancements to the GTFS To Network Dataset Transit Sources tool in ArcGIS Pro for the 2.5 release. Enhancements:

  • The tool has an optional boolean parameter allowing you to automatically interpolate blank stop times.
  • The tool doesn't fail when it encounters a data issue but instead throws a warning explaining the failure and then moves on to processing the next dataset. The tool overall only fails if none of the datasets could be processed successfully.
  • The tool has an optional boolean parameter allowing you to append a GTFS dataset onto some existing data model tables. So, if you do a first pass of processing and it skips a dataset because of an error, you can correct the error and then append the data onto the already-processed results without having to re-process everything.

The only thing I would be handling automatically is the interpolation. You could determine if your data is missing a calendar.txt file using a quick and easy pre-processing step and create a dummy calendar if needed. Also, if you run GTFS To Network Dataset Transit Sources in a script, you can grab the warnings returned, which have official numbers, and, based on the number, write some more code that deals with that specific problem. Maybe. A lot of data problems are hard to correct without a human. Like, if your lat/lon values have garbage in them, there's not really anything a computer can do.

How does this proposal sound to you?

mmorang avatar Jul 01 '19 20:07 mmorang

I think this definitely addresses stop times, but it does not address the issue of "stale" - "bad" calendar files. My concern is knowing there is a calendar mismatch issue is often very difficult at large scales. I could try to make something that does this, but some help in that arena would be good for files that cannot support general "weekdays". I think if a user picks a general weekday, the tool should pick the first matching day of the type specified in the calendar_dates.txt file (because it will have one or the other). Then it should flash a warning that there was no calendar.txt file to identify typical weekdays. Letting us solve the the problem is fine, but I think we should at least warn users that a calendar.txt file is not available for general weekday selections. Ideally, it would pick the first calendar_date.txt day that matches that chosen data, and provide a warning that it is doing so. Is this difficult to do? It seems like you could iterate through supported dates till one was classified as a 'wednesday' etc.

d-wasserman avatar Jul 01 '19 21:07 d-wasserman

With the release of ArcGIS Pro 2.5, the bullet points in https://github.com/Esri/public-transit-tools/issues/135#issuecomment-507412482 have been implemented.

I have not addressed any of the calendar issues mentioned in previous comments.

mmorang avatar Mar 06 '20 21:03 mmorang