datamon
datamon copied to clipboard
Datamon bundle list should include labels
Right now datamon bundle list returns something like this
datamon bundle list --repo flood-postgres
Using config file: /home/developer/.datamon/datamon.yaml
1N9KQjinEjRGtKovxksVG6biS3c , 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1N9jaJkPxbnZyPlz5HL16N3C5xg , 2019-06-25 22:04:28.307925983 +0000 UTC , Greatly prune flood PG data to reduce backup time
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1OCQ2W2lcIoRWQnBvk8r5D20J6B , 2019-07-18 19:41:31.151859934 +0000 UTC , Update flood alert thresholds
1OCf7RUcyfgecxrMUcwuko8Loex , 2019-07-18 21:45:31.68074457 +0000 UTC , Update flood alert thresholds
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds
I would like to see it return something like this:
datamon bundle list --repo flood-postgres
Using config file: /home/developer/.datamon/datamon.yaml
1N9KQjinEjRGtKovxksVG6biS3c , v1.0.0, 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1N9jaJkPxbnZyPlz5HL16N3C5xg , v1.0.1, 2019-06-25 22:04:28.307925983 +0000 UTC , Greatly prune flood PG data to reduce backup time
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label>, 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1OCQ2W2lcIoRWQnBvk8r5D20J6B , <no label>, 2019-07-18 19:41:31.151859934 +0000 UTC , Update flood alert thresholds
1OCf7RUcyfgecxrMUcwuko8Loex , <no label>, 2019-07-18 21:45:31.68074457 +0000 UTC , Update flood alert thresholds
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , v1.0.2, 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds
skimmed over some of git
to understand what needs to be done here, although i'm erring toward a semi-naive implementation:
list labels will get label data and bundle data, put the labels in a map of slices by bundle id, then the listing of bundles will get the label(s) out of the map.
sounds like a plan, @kerneltime ?
@galvare2 what to do in the case of multiple labels referring to the same bundle?
although we're not using the exact same format, git
commit 72c11afbbbb0ea7ec8edaf8601d203977e5ea7f6 (tag: 0.5, d20190625-storageputparamtype--wip)
Merge: b6ee99e 90cb6e6
Author: ransomw1c <[email protected]>
Date: Mon Jun 24 12:51:11 2019 -0700
suggests parentheses, so
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label>, 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1N9KQjinEjRGtKovxksVG6biS3c , (v1.0.0), 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , (v1.0.2; latest), 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds
is one sketch that accounts for zero, one, or two labels. note that the parenthesized list uses a different delimiter than the outer list...
... except we indeed don't want to roll our own serialization format. so perhaps something more like
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label> , 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1N9KQjinEjRGtKovxksVG6biS3c , v1.0.0 , 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , v1.0.2;latest , 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds
where we continue to use CSV and have a separate delimiter for labels?
note to self: the above will require validation (and coercion for backward-compatibility) of labels to ensure that the labels themselves don't contain the delim char.
The choice to include labels when listing bundles should be optional. There is a performance implication for it. The current model in place allows us to have confidence that a bundle (json) once written is never updated except for the labels. Any performance improvements at scale will need that to change or a significant engineering spend. @galvare2 is it fair to say that when listing bundles you only care about the ones that have a label? An alternative is to only list the bundles that have labels in the format laid out above by you and @ransomw1c. I would like to better understand why you want it this way and if there is a way to meet that need without introducing code that will work slower than the time it takes to list bundles (no "join"). There are some other features I have been thinking of that will allow queries and richer enumerations to work at scale but I do not think that is an urgent need. Let's talk more next week.
The choice to include labels when listing bundles should be optional. There is a performance implication for it.
i agree
other features I have been thinking of that will allow queries
to describe, iirc: the plan here is to maintain local (and remote?) indices via a db like badger to support more performant join-like (and otherwise) queries on metadata.
i've perused some of git
regarding tags, the analog of datamon
labels, and it relies on a "reflog," where refs are the internal unifying abstraction for branch tips and tags.. haven't fully grokked all implementation details, yet the reflog is distinct from the full-on indexing as far as i can tell so far.
why you want it this way
i arrived at this suggestion as a way to get the functionality implemented immediately without making this iss dependent on the indexing decision-making. moreover, reading git
suggests that, in case of not using the reflog, the less performant lookup solution is what git
, probably the most significant prior art for datamon
as it exists currently, does (mildly foggy on this latter claim).
i agree that there are workarounds. here's a stopgap Zsh script to do the non-performant lookup implementation
#! /bin/zsh
# cd $DATAMON_REPO && make build-datamon-mac
BIN=out/datamon.mac
repo_name=
while getopts r:l:b: opt; do
case $opt in
(r)
repo_name="$OPTARG"
;;
(\?)
print Bad option, aborting.
exit 1
;;
esac
done
(( OPTIND > 1 )) && shift $(( OPTIND - 1 ))
if [ -z $repo_name ]; then
repo_name='ransom-datamon-test-repo'
fi
typeset -a bundle_list_lines
typeset -a bundleIDs
typeset -A bundleIDsToListLines
$BIN bundle list --repo $repo_name 2>&1 | \
grep -v '^Using config file' | \
while read bundle_list_line; do
bundle_list_lines=($bundle_list_line $bundle_list_lines)
bundleID=$(print $bundle_list_line | cut -d',' -f 1 | tr -d ' ')
bundleIDs=("$bundleID" $bundleIDs)
bundleIDsToListLines[$bundleID]=$bundle_list_line
done
typeset -A bundleIDsToLabels
$BIN label list --repo $repo_name 2>&1 | \
grep -v '^Using config file' | \
while read label_list_line; do
label=$(print $label_list_line | cut -d',' -f 1 | tr -d ' ')
bundleID=$(print $label_list_line | cut -d',' -f 2 | tr -d ' ')
if [[ -z ${bundleIDsToLabels[$bundleID]} ]]; then
bundleIDsToLabels[$bundleID]="$label"
else
existingLabelList=${bundleIDsToLabels[$bundleID]}
bundleIDsToLabels[$bundleID]="$label;$existingLabelList"
fi
done
for bundleID in $(print "$bundleIDs"); do
bundle_list_line=${bundleIDsToListLines[$bundleID]}
labelList=${bundleIDsToLabels[$bundleID]}
print "$bundle_list_line , $labelList"
done
@kerneltime this request is more of a "nice to have" than an actual need, as i can get the information from the other cli commands. I guess the use case was just to be able to see all information about bundles and labels for a repo in one place, including bundles with no labels, for example in order to verify that uploads I ran went through correctly. But this should be considered a low priority because label list does almost everything that is needed for this anyway.