datamon
datamon copied to clipboard
upload instrumentation
at One Concern, in addition to using the sidecar within Argo workflows, we distribute datamon to desktop with brew
.
frequently, data-scientists need to "ingest," we say, data into the Argo workflows comprising the flood, for instance, simulation pipeline(s) without running a pre-packaged ingestor workflow. sometimes there's a 500 error or bundle upload
or bundle mount new
fail for one reason or another. this task proposes to begin to address the pain-point already solved in part by the fact that duplicate blobs (2k chunks) aren't uploaded twice.
specifically, the idea is to instrument (via golang in the binary, shell-script as in the sidecar, or Python, bindings for which exist in #393 , not having been merged only because of documentation requirements) the paths from desktop to cloud (bundle upload
, bundle mount new
, etc) to provide
- metrics and usage statistics to improve datamon
- progress indicators, logging, and a smoother experience for data-science
- any and all additional tracing,
time
ing, and output formatting to ease backpressure on this known iss
this'd be a great starter issue because it's not cloud-specific (minor changes would allow fork that syncs your, the user/programmer's, local disk with arbitrary filesystem-like things) and the provided patch is mostly out-of-band/orthogonal/... to the rest of the datamon implementation.
i should also mention that there is an alternate approach to the same essential use-case of adding additional data sources from desktop in #413 . the idea in that proposal, again, is to allow arbitrary first-miles into the cluster, then allow the web scheduler to fully digest the data into datamon, dry style.