cstar_perf icon indicating copy to clipboard operation
cstar_perf copied to clipboard

Refactor storage of blobs into more manageable chunks

Open EnigmaCurry opened this issue 8 years ago • 1 comments

Test artifacts are stored on the frontend in a single table with each artifact a single row. This has worked fine for small tests, but this will hemorrhage with larger data sets as it's being stored in a single blob.

Current schema:

CREATE TABLE test_artifacts (test_id timeuuid, artifact_type text, description text, artifact blob, PRIMARY KEY (test_id, artifact_type));

Current update path:

UPDATE test_artifacts SET description = ?, artifact = ? WHERE test_id = ? AND artifact_type = ?;

I propose that we chunk out the blobs into 10MB (configurable?) chunks. This will fix two foreseeable issues:

  • Cassandra timeouts and commit log segment size issues.
  • Timeouts in the websocket code where it's waiting for Cassandra.

Necessary changes:

  • model.py:update_test_artifact - should be made to accept a portion of the file and a segment number. The DB table should have some metadata that records the current accumulated SHA of the data so that we can verify the chunks. As new data is stored, this SHA accumulator should be updated. There should be an additional field that indicates the the file is completely stored in the DB. This field should be referenced each time we display the file in the frontend (the file should never appear if it is not 100% complete.)
  • client code - the client currently sends the entire test artifact in one session. If it fails to upload entirely, the next time it reconnects it will try again from the beginning. This should be made so that it sends it in chunks (same chunk size as DB.) If the upload is aborted, the upload should resume after the last successful chunk. The client should calculate the SHA of the data up to that chunk and send it to the server so that it can validate it is in sync before resuming.

EnigmaCurry avatar Nov 03 '15 15:11 EnigmaCurry