gtfs-validator icon indicating copy to clipboard operation
gtfs-validator copied to clipboard

Processing a large file results in Out of Memory errors

Open KClough opened this issue 1 year ago • 7 comments

Describe the bug

When processing a large GTFS file, the desktop validator doesn't show any errors to the user.

When checking the system_errors.json file, the following error is present:

{
      "code": "thread_execution_error",
      "severity": "ERROR",
      "totalNotices": 1,
      "sampleNotices": [
        {
          "exception": "java.lang.OutOfMemoryError",
          "message": "Java heap space"
        }
      ]
}

Steps/Code to Reproduce

  1. Download latest release of the desktop validator (v4.0.0 at time of writing)
  2. Install the validator
  3. Download this dataset
  4. Run the desktop validator against the downloaded dataset
  5. When complete, observe the system_errors.json file contains a java.lang.OutOfMemoryError.

Expected Results

The validator should complete without error as long as the system has enough memory.

Actual Results

The validator should completes with error.

Screenshots

No response

Files used

File is too large to upload. Linked here.

Validator version

4.0.0

Operating system

MacOS Ventura 13.0, M1 Pro, 16GB Ram

Java version

openjdk version "11.0.16.1" 2022-08-12

Additional notes

When testing the CLI validator with different JVM settings, here were my results:

-Xmx6g :x: -Xmx7g :white_check_mark: -Xmx8g :white_check_mark:

KClough avatar Mar 20 '23 21:03 KClough

The official DELFI GTFS feed (available with a stable URL here) is even bigger (2.6gb vs 1gb, 36m vs 28m stop_times rows), so you might want to use that one for benchmarking.

derhuerst avatar Mar 21 '23 01:03 derhuerst

It seems like the validator needs less memory when running it with >1 threads (-t). 🤔

(I haven't looked into this properly, this is just my observation based on a few test runs.)

derhuerst avatar May 22 '23 12:05 derhuerst

It might also be worthwhile comparing GTFS Validator's memory usage to GTFSVTOR's, in order to identify places where one of the two are inefficient. related: https://github.com/mecatran/gtfsvtor/issues/31#issuecomment-643460883

derhuerst avatar May 22 '23 12:05 derhuerst

Hey all, I mentioned in the last GTFS Validators contributors meeting that I was thinking about reducing memory usage in the validator. I've got some thoughts (including some ideas inspired by GTFSVTOR) on what we should try, written up in the following doc:

[Public] GTFS Validator - Memory Reduction

Comments and feedback appreciated. Thanks!

bdferris-v2 avatar Feb 05 '24 06:02 bdferris-v2

[Public] GTFS Validator - Memory Reduction

I'm not a Java dev, but the mentioned approach(es) sound reasonable!

However, as a consumer, I'd also be interested in the implications for the performance, particularly the wall clock time required to process a feed.

derhuerst avatar Feb 06 '24 17:02 derhuerst

Hey all, I mentioned in the last GTFS Validators contributors meeting that I was thinking about reducing memory usage in the validator. I've got some thoughts (including some ideas inspired by GTFSVTOR) on what we should try, written up in the following doc:

[Public] GTFS Validator - Memory Reduction

Comments and feedback appreciated. Thanks!

Hi @bdferris-v2, I agree with the general idea. I wonder if you considered implementing a mechanism to keep the current implementation and add the new GTFS entities implementation as a "default". We can parameterize the implementations in a way that might be easy to test with different approaches while experimenting and use different implementations depending on the feed. I see some implementations that might have a better performance vs. others with smaller memory footprints.

davidgamez avatar Feb 13 '24 14:02 davidgamez

@davidgamez I'm not exactly sure how it would work yet, but i agree that it would be helpful to conditionally support both implementations for both the existing data model and the newly proposed column-based model for the reasons you outline. I think it will be a little bit trickier for the trip-pattern proposal, given the larger changes to data model, but perhaps not impossible.

bdferris-v2 avatar Feb 14 '24 19:02 bdferris-v2