vets-api icon indicating copy to clipboard operation
vets-api copied to clipboard

Form 526 Document Upload - Lighthouse status polling system

Open NB28VT opened this issue 11 months ago • 5 comments

Summary

Implements a system for periodically polling the Lighthouse uploads/status endpoint to check on the status of any supplemental documents that were submitted to Lighthouse as part of a Form 526 Submission.

The purpose of this system is to surface successes and failures that happen in the Form 526 process after we hand them off to Lighthouse.

NOTE THIS PR IS AN MEANT AS AN EARLY STAGE CODE REVIEW THERE ARE STILL SOME REMAINING ITEMS:

Remaining items:

  • [x] Full e2e testing and VCR cassette recording using documents in Lighthouse's QA environment; I still need to get the right combination of environment/credentials working for this.
  • [ ] Error handling/logging for non-200 responses from Lighthouse
  • [ ] Lighthouse has added a 'requestIdsNotFound' array to the metadata in case we request a document that doesn't exist/was deleted. Unsure yet how we want to handle that situation but worth discussing.
  • [ ] Adding the chron job to the job scheduler
  • [ ] Adding new file paths to the CODEOWNERS file
  • [ ] Fix db/schema diff

Overview

Following the Lighthouse Document Upload migration (paused pending this polling implementation), we will begin uploading suppemental forms and documents for Form 526 submissions to the new Lighthouse /uploads POST endpoint.

That endpoint returns a requestId which corresponds to the document in Lighthouse.

At the request of OCTO, Lighthouse has provided us an additional endpoint, POST /uploads/status, which we can use to check on the progress of these documents after we upload them to Lighthouse. After Lighthouse has received them, they do the following:

  1. Forward the document on to the VBMS
  2. If that succeeds, send the document to BGS as well.

The uploads/status endpoint takes an array of requestIds and gives us the current status of all of those documents as they move throught Lighthouse > VBMS/Lighthouse > BGS. The status response looks like this:

{
  "data": {
    "statuses": [
      {
        "requestId": 600000001,
        "time": {
          "startTime": 1502199000,
          "endTime": 1502199000
        },
        "status": "IN_PROGRESS",
        "steps": [
          {
            "name": "BENEFITS_GATEWAY_SERVICE",
            "nextStepName": "BENEFITS_GATEWAY_SERVICE",
            "description": "string",
            "status": "NOT_STARTED"
          }
        ],
        "error": {
          "detail": "string",
          "step": "BENEFITS_GATEWAY_SERVICE"
        }
      }
    ]
  }
}

This polling architecture is designed to query that endpoint and track the success and failures of documents we upload and save their state in our database for visibility purposes. It is also designed to drive DataDog dashboards and alerts that reflect this information

(both endpoints are documented here)

Design

I've split this polling behavior across a handful of components, each of which I will summarize here:

Lighthouse526DocumentUpload

  • Database model for representing a document we've uploaded to Lighthouse
  • Includes a number of validations/associations I thought made sense for this class but that may be a little defensive/overengineered for V1.
  • For example, all Veteran Upload document types must have an associated FormAttachment record, since this is where we store the S3 pointer for this document and we may need to retrieve it if it has issues.
  • This is where we store the requestId Lighthouse returns, as lighthouse_document_request_id

BenefitsDocuments::Form526::UploadStatusUpdater

  • This is the primary service class for parsing Lighthouse metadata from the status endpoint and updating the status of Lighthouse526DocumentUpload records in our database.
  • Because the metadata is relatively intricate/nested, I wanted to have a single class that handles the parsing and reconciling updates with our records, to hide these nitty gritty details from the wider polling flow.

BenefitsDocuments::Form526::UpdateDocumentsStatusService

  • This is the service class that orchestrates the polling behavior
  • This interacts at a high level with BenefitsDocuments::Form526::UploadStatusUpdater to log status updates on documents to DataDog via StatsD
  • This data will be used to drive dashboards on DataDog.

Lighthouse::Form526DocumentUploadPollingJob

  • The actual chron job that runs
  • Includes our obligatory retry exhaustation handling

BenefitsDocuments::Form526::DocumentsStatusPollingService

  • Configuration for the status endpoint polling, modeled on the mobile team's approach
  • Note this includes changes to BenefitsDocuments::Configuration, which is mobile team's client wrapper for the Lighthouse Benefits Documents API (so we don't have to reinvent the wheel with their complex auth flow)

Behavior Summary

  1. When we upload a document, create a Lighthouse526DocumentUpload record and save the lighthouse_document_request_id they give us.
  2. Mark this document's state as pending
  3. Every day, run the Lighthouse::Form526DocumentUploadPollingJob. This will poll all pending documents that haven't been polled in the past 24 hours
  4. For each document Lighthouse gives us a status update on:

If the status has changed since the last time we polled this document:

  • Log the status metadata from Lighthouse to the Rails logger
  • Save the same data in the last_status_response field on the Lighthouse526DocumentUpload

If Lighthouse tells us the Document is complete (meaning it was sent to both VBMS and BGS):

  • Change the Lighthouse526DocumentUpload state to completed
  • Saves the time Lighthouse told us processing ended
  • Increment a StatsD complete metric with the type of document (example for BDD Instructions: api.form_526.lighthouse_document_upload_processing_status.bdd_instructions.complete)

If the Lighthouse tells the Document failed somewhere:

  • Change the Lighthouse526DocumentUpload state to failed
  • Saves the time Lighthouse told us processing ended
  • Save the error message to the Lighthouse526DocumentUpload record
  • Increment a failed StatsD metric with the type of document AND the ["error"]["step"] in the lighthouse status (see above) ex. api.form_526.lighthouse_document_upload_processing_status.bdd_instructions.failed.claims_evidence
  • Since Lighthouse could always change the step names, we have made the end of that metric key dynamic
  • We will likely create a failure alert for this in DataDog

Regardless of whether the document completed, failed or is still in progress:

  • Update the document's status_last_polled_at timestamp to reflect the fact we just polled this.

  • This will ensure whenever the job runs it only polls documents that haven't been polled in the last 24 hours.

  • *This work is behind a feature toggle (flipper): NO - this code is isolated until we start creating Lighthouse526DocumentUpload records in the Lighthouse migration, which will be behind a feature flag

  • (Summarize the changes that have been made to the platform) See above

  • (If bug, how to reproduce) N/A

  • (What is the solution, why is this the solution?) See above

  • (Which team do you work for, does your team own the maintenance of this component?) I'm on the Benefits and Disabilites team

  • (If introducing a flipper, what is the success criteria being targeted?) N/A

Testing done

Extensive unit tests here; working on end to end tests using the QA environment Lighthouse has provided.

Acceptance criteria

  • [ ] I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • [ ] No error nor warning in the console.
  • [ ] Events are being sent to the appropriate logging solution
  • [ ] Documentation has been updated (link to documentation)
  • [ ] No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • [ ] Feature/bug has a monitor built into Datadog or Grafana (if applicable)
  • [ ] If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • [ ] I added a screenshot of the developed feature

Requested Feedback

I'd recommended reviewing the design and behavior summary above before looking at the code; there are a lot of unit tests which should give an indication of all of the edge cases I anticipate. So that may help guide reviewing the structure overall.

Questions I would have for a reviewer:

  1. Is there anything missing from the "remaining items" list above you can think of?
  2. Do you think the validations and callbacks around Lighthouse526DocumentUpload are a little over engineered for a V1 release?
  3. Something weird happened with my db/schema diff when I migrated recent schema changes - perhaps my Postgres version is different than the one in master? It made several unrelated changes that appear to relate to database configs. Will try and figure this out on my own unless you have some insight.
  4. Note my use of update_all to bulk update the status_last_polled_at field on a batch of Lighthouse526DocumentUpload records in Form526DocumentUploadPollingJob is a no-no - update_all skips Rails validations so it's generally frowned upon. As this is just updating a timestamp field with no other ramifcations I'm making this tradeoff for performance gains. Lmk if you disagree.

NB28VT avatar Mar 15 '24 18:03 NB28VT