django-DefectDojo icon indicating copy to clipboard operation
django-DefectDojo copied to clipboard

Add fully async import/reimport

Open rc-mattschwager opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe

When calling the import/reimport API endpoint with large (30k+ findings) payloads it takes a long time for the server to respond. Response time is highly dependent on configuration and hardware, but I see processing times of ~10min. The processing time itself is not a problem, but leaving a client connection open for 10min+ is prone to network hiccups and timeouts, so it's difficult to know when processing completes.

Describe the solution you'd like

I've reviewed the existing async import functionality (https://github.com/DefectDojo/django-DefectDojo/pull/5553) and relevant code. As mentioned in the PR, this is more of a semi-async implementation. I've enabled DD_ASYNC_FINDING_IMPORT and used large chunk sizes like 1000 with DD_ASYNC_FINDING_IMPORT_CHUNK_SIZE. The response still takes quite a long time because, as mentioned in the initial PR:

a successful import waits for all workers to finish before returning a successful response.

In this feature request I would purpose a fully-async import/reimport based on client polling. Something along the lines of:

  • Fully-async mode is enabled (globally through an ENV variable, specifically on the API endpoint through a parameter, etc)
  • The client sends an async, large import/reimport request to the server
  • The server:
    • Parses the request
    • Returns success to the client if parsing succeeded, else failure
    • Fires off the processing task(s) to the workers
  • The client can poll the processing task until it's complete
    • Polling the test on test.percent_complete seems fine for an initial implementation
    • Using some kind of transaction ID for polling could be used in the future

Describe alternatives you've considered

Using DD_ASYNC_FINDING_IMPORT and DD_ASYNC_FINDING_IMPORT_CHUNK_SIZE as mentioned above. I've also tried chunking the findings in the client and sending batched requests to the findings endpoint directly. This is an okay solution, but it's not as ergonomic as sending a single payload to the import/reimport endpoint. It is also likely to be slower because the server/worker can perform bulk processing or other DB tricks, whereas the client does not have these options available.

Additional context

N/a

rc-mattschwager avatar Jul 15 '22 16:07 rc-mattschwager

@rc-mattschwager why are you setting the chunk size so high? That is essentially defeating the purpose of the async functionality.

For example, if you have a report with 800 findings, and you set the chunk size to 1000, all you are doing is moving the processing to a worker node to run the entire thing synchronously.

With the default chunk size of 100, those 800 findings can be processed async in 8 workers. Certainly you would see a reduction in the overall import time

Maffooch avatar Aug 05 '22 14:08 Maffooch

Ahh, thanks for the tip. I was considering large chunk sizes to be better, not worse. However, we are dealing with reports of 30k+ findings, so a chunk size of 1000 is still sufficiently low to spread the work amongst multiple workers.

But even with that, the issue here is with a very long lived API connection waiting on a success response, not overall import time. Importing many 10s of thousands of findings will surely take some time. Instead of waiting on complete, successful import, then receiving a response, it would be useful to have a fully async mode where the client can enable an async import, immediately get a response that the request was well-formed and accepted, then having a polling API endpoint to monitor the progress of the import.

rc-mattschwager avatar Aug 19 '22 14:08 rc-mattschwager

Wow that is a huge report! That explains the ned for a faster 200 code then. What tool are using? Surely some sort of infra scanning

The hesitation for sending the the 'OK' after processing the scan result file is that the the only information we would have at the time of 200 code is that the parser ingested the report without error. If there was information from the report that does not fit the db model like it did when the parser was written, it would be very difficult to catch that error. For example, say something in a finding was not stored correctly, but the reference is still saved to a given test. When a user goes to view that finding they will likely get a 500 and would not be able to fix it without going to the admin page (which is dangerous in itself)

I love the idea of quickly returning the OK code for the benefit of pipelines not having to wait for the code to continue, but the implementing with the guardrails in place would be super tricky.

Maffooch avatar Aug 19 '22 16:08 Maffooch