gtfs-validator icon indicating copy to clipboard operation
gtfs-validator copied to clipboard

Proposal: enable direct deployment of validator as a web service

Open themightychris opened this issue 2 years ago • 10 comments

Feature request

Background

  • Organizations want to make use of the canonical validator in custom pipelines and user interfaces, but often lack Java expertise
  • MobilityData may want to host a public web-based validation user interface in the future
  • The dominant pattern for operationalizing scalable computing resources today is to publish a Docker container that exposes an HTTP service that can be pooled behind a load balancer and be horizontally scaled as needed (i.e. turn up and down number of replicas available in the pool)

Proposed solution

A relatively small amount of work could significantly improve the utility of the gtfs-validator repository:

  • Merge #1120, bringing the Docker container image published to ghcr.io in line with Docker best practices
  • Integrate an optional HTTP server into the validator jar (and standard Docker container image published via ghcr.io), potentially based on the work in #1088 but excluding the frontend code and HTML responses, exposing two JSON API endpoints when the Docker container is invoked in the form docker run ghcr.io/mobilitydata/gtfs-validator:v3.1.0 --web_server:
    • GET /ready: returns HTTP status 200 if the HTTP service is available and ready to accept validation requests
    • POST /validate: accepts a GTFS zip file and returns a JSON report
  • Add an openapi.yaml file to the root of the repository documenting the two available HTTP endpoints in OpenAPI 3.0 format
  • Provide an example Docker command line for running the ghcr.io-published Docker container image as a web service.
    • This provides a baseline reference that deployers can translate to any container-based deployment system
  • Publish a usable Helm chart for deploying the validator as a Kubernetes deployment with horizontal pod autoscaling configuring
    • This provides a more full-features reference for a scalable deployment that can be point-and-click deployed to any managed provider for those using Kubernetes

Key outcomes

Organizations operationalizing the canonical validator as part of their infrastructure can:

  • utilize docker run ghcr.io/mobilitydata/gtfs-validator:v3.1.0 --web_server as a well-defined and stable runtime semantic
  • utilize the HTTP API documented in openapi.yaml as a well-defined and stable contract with internal services utilizing the validator
  • quickly increment the validator version number deployed as MobilityData publishes new versions

Out of scope

This proposal does not suggest implementing within the validator repository:

  • a web-based user interface (but this work would be an enabling step for various paths to deploying a web-based user interface in the future)
  • any sort of access control (common means for deploying HTTP web services all provide for addressing that at a higher level e.g. via private networking or reverse proxies)
  • any sort of performance optimization (horizontal container scaling can serve as a kludge while potential performance improvements are pursued in parallel)

Alternatives

  • Organizations can build their own web services that use the Java API's to bind with the validator's JAR file
    • Many organizations lack Java development expertise necessarily for leveraging Java interfaces directly
  • Organizations can build workflows that invoke the CLI and then collect its output files
    • This approach requires users to build a lot of repetitive glue code to invoke the validator within a network services environment and to make the validator available as a scalable computing service
    • The current semantics do not conform to the input/output semantics of POSIX—instead of executing the validator and then reading the results from persistent storage, it accepting STDIN and returning results over STDOUT would be preferable for use in custom pipelines
  • Implement a web service outside the validator repository
    • This is a feasible alternative, but it would be ideal if the officially maintained ghcr.io container image was ready for direct deployment to modern container-based infrastructure like Kubernetes or Cloud Run. This would help consumers stay up to date with the latest validator by enabling them to move between validator versions seamlessly by having the first-party artifact deployed directly into their infrastructure behind a simple and well-known network HTTP API

Implementation

If this approach is acceptable to the maintainers, @JarvusInnovations will (funder approval pending) provide a ready-to-merge PR implementing all of the above

themightychris avatar Jun 07 '22 16:06 themightychris

+1

jsteelz avatar Jun 07 '22 19:06 jsteelz

I have a question about the proposed behavior of the web service: would the service be synchronous? Namely, when a validation request is submitted to the service, would the API produce a response only once validation is complete? Or are you imagining something asynchronous, such that a validation request returns a response immediately and the caller would have to periodically check back (or subscribe to some sort of push update) to see when validation is complete.

The meta question here is what happens for validation of large feeds or multiple feeds requested in quick succession. Any thoughts / concerns there?

bdferris-v2 avatar Jun 13 '22 21:06 bdferris-v2

@bdferris-v2 that's a great question

Does anyone have any hard data on the min/avg/max runtimes they see in the field? I'll check with the Cal-ITP team on what we see in CA feeds

For my part, I was envisioning synchronous, at least to start. Otherwise we have to build some work queue mechanics into the validator which would be quite a bit more complexity. Whether that belongs inside the validator itself is a bigger question I personally would rather punt on advocating one way or the other for for now, but if MobilityData product has a strong direction in mind that'd be a different story. A synchronous implementation won't require much/any overhead to implement and an async/queue-based implementation could be implemented in a later effort without much/any rework.

My own intention is to create a Helm chart for a Kubernetes Deployment with an attached HorizontalPodAutoscaler which should handle the case of multiple feeds requested in quick succession decently, and provide that as a reference deployment. I don't know how well that would work in a pipeline application, but I expect it to work well enough for user-driven on-demand validations. Folks hitting it from pipelines could probably tune the HPA config via Helm values or use manual scaling to match their workloads and spread the work across enough workers via HTTP load balancing.

A queue would certainly be ideal for this sort of use case though. Does anyone know of any pre-built queue managers in the Java ecosystem that could be integrated simply enough to make a strong case for offering as a built-in option?

themightychris avatar Jun 13 '22 22:06 themightychris

I think the feeds in the end_to_end_big.yml workflow are probably representative of some typical runtimes for large feeds. There I see average processing times of around 3 minutes per feed. That's obviously at the high-end, but it's a long time to be waiting around for a response. I'm curious what the largest feed would be in the Cal-ITP corpus. If you were running a merged feed for the entire Bay Area, for example, I could see it being comparable. It doesn't sound like you are planning on using this in a processing pipeline. Would this be in response to a user action instead? Something else? What happens in your upstream system while you are waiting?

Agreed that a queue based approach with asynchronous response adds more complexity, but probably more so on the API side? On the processing side, you could probably get by a with a fixed thread pool executor (ala Executors.newFixedThreadPool(10)) to queue processing tasks. But I'm less clear what the API surface should look like.

Either way, I'm not generally opposed to this. If synchronous really works for you, then I'm agreed, it's simpler. In case there is any question, it'd be great if this functionality went into a different sub-project within the main Gradle project. Similar refactor is proposed for the CLI in #1188. I'd specifically propose to not integrate this functionality into the existing CLI, just to keep things simpler there.

bdferris-v2 avatar Jun 15 '22 17:06 bdferris-v2

I'd specifically propose to not integrate this functionality into the existing CLI, just to keep things simpler there.

+1 for this, if for no other reason than avoiding pulling in web server dependencies into the CLI JAR that aren't needed for the plain CLI

barbeau avatar Jun 15 '22 17:06 barbeau

This makes sense. Our initial idea was also to leverage Kubernetes Deployment to deploy the validator as a web service. I agree with the other aspects of your plan. Also, #1120 is almost ready to merge.

I think it is worth implementing a synchronous service in the first place, as it can be done much faster than for an asynchronous one and would solve the use case for many datasets, for which the validation processing time is very short. Although, I agree with @bdferris-v2 that for larger feeds, it may be a problem. So implementing the synchronous service while planning the asynchronous one would be the best.

maximearmstrong avatar Jun 16 '22 23:06 maximearmstrong

As far as exposing the validator as a web service, it is perfectly acceptable for there to be long-running requests. It isn't uncommon for popular HTTP services to rely on long-running requests. For example, every push/pull to/from a Docker container registry happens over a long-running HTTP request. The Kubernetes deployment can be configured to allow that with a single configuration to disable timeouts

The user experience is entirely a UI concern, it's well within the purview of an HTML/JS UI to display an indicator to the user that the validation is in progress and can take some time while running the POST in the background, and it's even less of a concern with other backend applications/services hitting the service.

For a first pass, it at least gets my team what we need for what we're trying to do.

As far as avoiding pulling web server dependencies into the JAR goes—if we go that route do we then need to publish two separate Docker container images too? Is the added weight of a bare-bones HTTP server module worth the complexity of split distribution? Are there any especially light weight libraries we could use for low-level multipart/form-data request parsing and JSON response encoding that would make this concern negligible?

themightychris avatar Jun 17 '22 21:06 themightychris

Commenting on the "separate Docker containers" question:

I guess I would ask: is a split distribution really that complex? Storage is cheap enough these days that I don't think we are worried about the storage costs of two separate Docker images. So we are really talking about the complexity of maintaining two separate Docker build configurations. I concede that's not zero. But I balance that against the complexity of having a single Docker image that accepts dramatically different command-line args depending on how you want to use it. In my mind, having separate containers with separate documentation would be a lot clearer from a user perspective. I'm curious how you think about the tradeoffs involved here?

bdferris-v2 avatar Jul 20 '22 17:07 bdferris-v2

@bdferris-v2 a split distribution is definitely not too complex, I was just coming at it initially from the perspective of shaking up the existing codebase and processes as little as possible to target something easy to get merged.

With efforts like #1223 and #1120 moving forward since then, I think we've strengthened our muscles a lot for shaking things up as a community, so it's probably worth worrying less about touching less vs moving more towards a better future state.

Putting aside longer-term goals like a scalable public web validator and a user-friendly frontend, I think the highest-value immediate goal should be making it so someone working on a project leveraging the validator can just add a 2-5 line service to their docker-compose.yaml file that could reference any of:

  • ghcr.io/mobilitydata/gtfs-validator-http:3.1.0
  • ghcr.io/mobilitydata/gtfs-validator-http:3
  • ghcr.io/cal-itp/gtfs-validator-http:stable
  • ghcr.io/themightychris/gtfs-validator-http:latest

Once we have a stable HTTP JSON spec defined with OpenAPI, we can infinitely evolve how we put that up behind an HTTP port and have a lot of room to extend it in verifiably backwards-compatible ways. Dedicating a unique Docker container image path to the HTTP interface now as suggested with the split distribution helps out there a lot. Things built to that shape will have a lot of paths to productionization and accomodate integration with lots of teams' dominant skillsets.

Frankly, I suspect we'd ultimately want to refactor this repo to look a lot more like https://github.com/MobilityData/gtfs-realtime-validator

It makes sense to me for the unit of each repo to be the specification like it is now, with multiple build products monorepo'd within that. I'd maybe just additionally split out the "official" web frontend to its own top-level subproject with the OpenAPI spec being the interface between it and the HTTP service jar. I don't know how much consensus there is around that though or how much refactoring makes sense to bite off right now and defer to @KClough to discuss that

I can see two immediate paths forward:

  1. Get started ASAP with the minimal set of changes to start enabling other projects to point at a canonical public docker image and build against an OpenAPI-spec'd HTTP interface (can be a sep docker image path). Start discussing potential refactoring in parallel

  2. Circle up technical stakeholders on a call first (our first regular one?) to see if there's some refactoring maybe everyone already wants to do that would make sense to take a bite out of in this first round of work. Maybe there's no getting around needing to do that before starting any of this.

themightychris avatar Aug 03 '22 18:08 themightychris

In my mind, having separate containers with separate documentation would be a lot clearer from a user perspective.

About the separate Docker images, I totally agree with this. ☝️

@themightychris, your plan makes sense. If @isabelle-dr also agrees, I'd say path/option 1 makes the most sense right now. Specifically, I see a scenario where we are able to deliver value more quickly this way, while we discuss how best to refactor the validator, if necessary.

maximearmstrong avatar Aug 03 '22 20:08 maximearmstrong

The proposal for this project is available here.

isabelle-dr avatar Nov 23 '22 13:11 isabelle-dr