datamon Datamon bundle list should include labels

Right now datamon bundle list returns something like this

datamon bundle list --repo flood-postgres
Using config file: /home/developer/.datamon/datamon.yaml
1N9KQjinEjRGtKovxksVG6biS3c , 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1N9jaJkPxbnZyPlz5HL16N3C5xg , 2019-06-25 22:04:28.307925983 +0000 UTC , Greatly prune flood PG data to reduce backup time
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1OCQ2W2lcIoRWQnBvk8r5D20J6B , 2019-07-18 19:41:31.151859934 +0000 UTC , Update flood alert thresholds
1OCf7RUcyfgecxrMUcwuko8Loex , 2019-07-18 21:45:31.68074457 +0000 UTC , Update flood alert thresholds
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds

I would like to see it return something like this:

datamon bundle list --repo flood-postgres
Using config file: /home/developer/.datamon/datamon.yaml
1N9KQjinEjRGtKovxksVG6biS3c , v1.0.0, 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1N9jaJkPxbnZyPlz5HL16N3C5xg , v1.0.1, 2019-06-25 22:04:28.307925983 +0000 UTC , Greatly prune flood PG data to reduce backup time
1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label>, 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1OCQ2W2lcIoRWQnBvk8r5D20J6B , <no label>, 2019-07-18 19:41:31.151859934 +0000 UTC , Update flood alert thresholds
1OCf7RUcyfgecxrMUcwuko8Loex , <no label>, 2019-07-18 21:45:31.68074457 +0000 UTC , Update flood alert thresholds
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , v1.0.2, 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds

Jul 18 '19 22:07 galvare2

skimmed over some of git to understand what needs to be done here, although i'm erring toward a semi-naive implementation:

list labels will get label data and bundle data, put the labels in a map of slices by bundle id, then the listing of bundles will get the label(s) out of the map.

sounds like a plan, @kerneltime ?

Jul 19 '19 18:07 ransomw1c

@galvare2 what to do in the case of multiple labels referring to the same bundle?

although we're not using the exact same format, git

commit 72c11afbbbb0ea7ec8edaf8601d203977e5ea7f6 (tag: 0.5, d20190625-storageputparamtype--wip)
Merge: b6ee99e 90cb6e6
Author: ransomw1c <[email protected]>
Date:   Mon Jun 24 12:51:11 2019 -0700

suggests parentheses, so

1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label>, 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1N9KQjinEjRGtKovxksVG6biS3c , (v1.0.0), 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , (v1.0.2; latest), 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds

is one sketch that accounts for zero, one, or two labels. note that the parenthesized list uses a different delimiter than the outer list...

... except we indeed don't want to roll our own serialization format. so perhaps something more like

1NCMOUOqdxqkrXMCIkqmZXbjIIZ , <no label> , 2019-06-26 20:23:12.978399405 +0000 UTC , Backup from deployed data
1N9KQjinEjRGtKovxksVG6biS3c , v1.0.0 , 2019-06-25 18:37:37.934833707 +0000 UTC , First version of flood pg data
1OCg1ZgwBgdx0Oz3K1gjdGAqRuq , v1.0.2;latest , 2019-07-18 21:52:57.59154874 +0000 UTC , Update flood alert thresholds

where we continue to use CSV and have a separate delimiter for labels?

note to self: the above will require validation (and coercion for backward-compatibility) of labels to ensure that the labels themselves don't contain the delim char.

Jul 19 '19 18:07 ransomw1c

The choice to include labels when listing bundles should be optional. There is a performance implication for it. The current model in place allows us to have confidence that a bundle (json) once written is never updated except for the labels. Any performance improvements at scale will need that to change or a significant engineering spend. @galvare2 is it fair to say that when listing bundles you only care about the ones that have a label? An alternative is to only list the bundles that have labels in the format laid out above by you and @ransomw1c. I would like to better understand why you want it this way and if there is a way to meet that need without introducing code that will work slower than the time it takes to list bundles (no "join"). There are some other features I have been thinking of that will allow queries and richer enumerations to work at scale but I do not think that is an urgent need. Let's talk more next week.

Jul 20 '19 11:07 kerneltime

The choice to include labels when listing bundles should be optional. There is a performance implication for it.

i agree

other features I have been thinking of that will allow queries

to describe, iirc: the plan here is to maintain local (and remote?) indices via a db like badger to support more performant join-like (and otherwise) queries on metadata.

i've perused some of git regarding tags, the analog of datamon labels, and it relies on a "reflog," where refs are the internal unifying abstraction for branch tips and tags.. haven't fully grokked all implementation details, yet the reflog is distinct from the full-on indexing as far as i can tell so far.

why you want it this way

i arrived at this suggestion as a way to get the functionality implemented immediately without making this iss dependent on the indexing decision-making. moreover, reading git suggests that, in case of not using the reflog, the less performant lookup solution is what git, probably the most significant prior art for datamon as it exists currently, does (mildly foggy on this latter claim).

i agree that there are workarounds. here's a stopgap Zsh script to do the non-performant lookup implementation

#! /bin/zsh

# cd $DATAMON_REPO && make build-datamon-mac

BIN=out/datamon.mac


repo_name=

while getopts r:l:b: opt; do
    case $opt in
        (r)
            repo_name="$OPTARG"
            ;;
        (\?)
            print Bad option, aborting.
            exit 1
            ;;
    esac
done
(( OPTIND > 1 )) && shift $(( OPTIND - 1 ))

if [ -z $repo_name ]; then
    repo_name='ransom-datamon-test-repo'
fi

typeset -a bundle_list_lines
typeset -a bundleIDs
typeset -A bundleIDsToListLines
$BIN bundle list --repo $repo_name 2>&1 | \
    grep -v '^Using config file' | \
    while read bundle_list_line; do
        bundle_list_lines=($bundle_list_line $bundle_list_lines)
        bundleID=$(print $bundle_list_line | cut -d',' -f 1 | tr -d ' ')
        bundleIDs=("$bundleID" $bundleIDs)
        bundleIDsToListLines[$bundleID]=$bundle_list_line
    done

typeset -A bundleIDsToLabels
$BIN label list --repo $repo_name 2>&1 | \
    grep -v '^Using config file' | \
    while read label_list_line; do
        label=$(print $label_list_line | cut -d',' -f 1 | tr -d ' ')
        bundleID=$(print $label_list_line | cut -d',' -f 2 | tr -d ' ')
        if [[ -z ${bundleIDsToLabels[$bundleID]} ]]; then
            bundleIDsToLabels[$bundleID]="$label"
        else
            existingLabelList=${bundleIDsToLabels[$bundleID]}
            bundleIDsToLabels[$bundleID]="$label;$existingLabelList"
        fi
    done

for bundleID in $(print "$bundleIDs"); do
    bundle_list_line=${bundleIDsToListLines[$bundleID]}
    labelList=${bundleIDsToLabels[$bundleID]}
    print "$bundle_list_line , $labelList"
done

Jul 22 '19 18:07 ransomw1c

@kerneltime this request is more of a "nice to have" than an actual need, as i can get the information from the other cli commands. I guess the use case was just to be able to see all information about bundles and labels for a repo in one place, including bundles with no labels, for example in order to verify that uploads I ran went through correctly. But this should be considered a low priority because label list does almost everything that is needed for this anyway.

Jul 24 '19 19:07 galvare2