if feat(src): add support for appending to existing outputs

Types of changes

[x] Enhancement (project structure, spelling, grammar, formatting)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] My code follows the code style of this project.
[x] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed.

A description of the changes proposed in the Pull Request

Closes #845.

A few areas to discuss/agree:

Behaviour for group by / aggregation - I may not fully understand the implications for these types of transformations yet.
Does the way the argument is passed around make sense
Is it ok this living as part of the core compute logic, as opposed to one of the builtins or other extension points?

Aug 04 '24 08:08 jamescrowley

Hi @jamescrowley - just back from vacation today - looking forward to checking this out tomorrow. Thanks for your efforts!

Aug 06 '24 15:08 jmcook1186

Hi @jamescrowley great point about aggregation and time-sync. I think the only behaviour that makes sense is to re-run the aggregation / time-sync over the new data and overwrite whatever is already present.

The "right" way to use append would probably be for the user to toggle time-sync and aggregation on some final manifest file after all the append actions have been taken, but in cases where a user is appending to an already-synced-or-aggregated manifest, the time-sync and/or aggregation should be repeated with the new input data, with the results overriding the original.

re groupby - in the latest IF (main branch, will be in the next release today/tomorrow) it's actually called regroup and executed as part of the IF execution flow rather than as a plugin (it's done automatically if you use if-run with no other commands, but you can isolate just that function using if-run --regroup with the necessary config in the manifest). Each time regroup is executed is flattens all the input data and regroups again from that 1D array, so I think that should just work even with --append...

It's ok having this as an amendment to the core execution logic - I don't think it makes sense as a plugin/builtin or other extension.

Aug 07 '24 15:08 jmcook1186

also tagging in @narekhovhannisyan for visibility

Aug 07 '24 15:08 jmcook1186

👍 I’ll rebase and add some tests to try and capture that behaviour

Aug 09 '24 08:08 jamescrowley

@jmcook1186 I'm working through regroup as there's a few things to suss out there, but in the meantime, some questions re documenting the aggregation behaviour w/ tests:

It appears the only place I can capture the combined compute & aggregate behaviour is in the integration tests, is that right?
append is currently an argument to if-run, which means it is not loaded from the manifest, and thus I'm unable to write a manifest that tests this behaviour for the integration tests.

As I see it the options are: (a) don't add any tests to explicitly capture this (or express them only in terms of stand-alone compute or aggregate behaviour) (b) adding some tests as part of the jest test suite that hooks up the compute & aggregate steps (c) we figure out a way to pass the append argument through the manifest-based integration tests (d) something else?

Thanks!

Aug 17 '24 10:08 jamescrowley

Hi @jamescrowley - looping in @narekhovhannisyan who can give better answers to these questions!

Aug 17 '24 19:08 jmcook1186

@jmcook1186 @narekhovhannisyan also FYI on the regroup, I think the behaviour that makes most sense is to regroup outputs in addition to inputs when in append mode - otherwise we can't sensibly add the new outputs if the existing outputs aren't grouped. I'll push up some commits once I've ironed out the kinks but let me know if you object on principal and think I should go another way!

Aug 19 '24 18:08 jamescrowley

I don't object on principle - do you have an opinion @narekhovhannisyan?

Aug 20 '24 08:08 jmcook1186

I've pushed proposed changes.

Regroup is now fed outputs alongside inputs and merges those, when the append flag is set. I did some refactoring prior to the changes to make those easier.
I've kept the 'append' preservation of inputs independent within both the compute and regroup steps, as this felt clearer (here and here).
I've not copied the outputs with a deep clone like inputs - this didn't seem to be necessary but let me know if I've missed something subtle.
I'm not entirely happy with the tests - in that it's not always clear the specific behaviour we're actually looking for in the expected results (open to suggestions there), but for now just tried to make the test name descriptive.
There are currently no integration tests covering this (see issues previously flagged).

Aug 20 '24 11:08 jamescrowley

Hey @jmcook1186, @jamescrowley. sorry for late reply. Will check the threaд and will get back with the results.

Aug 20 '24 12:08 narekhovhannisyan

Hi @jamescrowley sorry it's taking a while to get this reviewed - we've got some somewhat urgent features to ship to support a demo in a couple of weeks, but I'm keen to get into this asap. I'm planning to check out your branch and start looking into it later today.

Cheers

Aug 28 '24 14:08 jmcook1186

Hi @jamescrowley - i checked out the branch yesterday and am having trouble making the append feature work - could you please provide an example manifest and run command combo that works for you so I can try to replicate locally?

Sep 03 '24 08:09 jmcook1186

@jmcook1186 I was just running the second example in the original ticket (with original inputs, and modified mock observations for a later time), with the compute/group updated for the latest syntax

name: append
description: >-
  a complete pipeline that starts with mocked CPU utilization data and outputs
  operational carbon in gCO2eq
initialize:
  plugins:
    mock-observations:
      path: builtin
      method: MockObservations
      global-config:
        timestamp-from: '2024-03-05T00:00:04.000Z'
        timestamp-to: '2024-03-05T00:00:07.000Z'
        duration: 1
        components:
          - name: server-1
            cloud/instance-type: Standard_E64_v3
            cloud/region: westus3
        generators:
          common:
            cloud/vendor: azure
          randint:
            cpu/energy:
              min: 1
              max: 99
            mem/energy:
              min: 1
              max: 99
    sum:
      path: builtin
      method: Sum
      global-config:
        input-parameters:
          - cpu/energy
          - mem/energy
        output-parameter: energy
execution:
  command: >-
    /home/user/.npm/_npx/1bf7c3c15bf47d04/node_modules/.bin/ts-node
    /home/user/Code/if/src/index.ts -m
    manifests/examples/mock-cpu-util-to-carbon.yml -s
  environment:
    if-version: 0.4.0
    os: linux
    os-version: 5.15.0-107-generic
    node-version: 21.4.0
    date-time: 2024-06-18T14:18:44.864Z (UTC)
    dependencies:
      - '@babel/[email protected]'
      - '@babel/[email protected]'
      - '@commitlint/[email protected]'
      - '@commitlint/[email protected]'
      - '@grnsft/[email protected]'
      - '@jest/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
  status: success
tree:
  pipeline:
    compute:
      - mock-observations
      - sum
    regroup:
      - cloud/region
      - name
  defaults: null
  inputs:
    - timestamp: '2024-03-05T00:00:00.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 5
      mem/energy: 10
    - timestamp: '2024-03-05T00:00:01.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 71
      mem/energy: 5
    - timestamp: '2024-03-05T00:00:02.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 36
      mem/energy: 74
  outputs:
    - timestamp: '2024-03-05T00:00:00.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 5
      mem/energy: 10
      energy: 15
    - timestamp: '2024-03-05T00:00:01.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 71
      mem/energy: 5
      energy: 76
    - timestamp: '2024-03-05T00:00:02.000Z'
      duration: 1
      name: server-1
      cloud/instance-type: Standard_E64_v3
      cloud/region: westus3
      cloud/vendor: azure
      cpu/energy: 36
      mem/energy: 74
      energy: 110

and running

npm run if-run -- -m manifests/outputs/features/append.yaml -o manifests/outputs/features/re-append --append

you then get output:

name: append
description: >-
  a complete pipeline that starts with mocked CPU utilization data and outputs
  operational carbon in gCO2eq
initialize:
  plugins:
    mock-observations:
      path: builtin
      method: MockObservations
      global-config:
        timestamp-from: '2024-03-05T00:00:04.000Z'
        timestamp-to: '2024-03-05T00:00:07.000Z'
        duration: 1
        components:
          - name: server-1
            cloud/instance-type: Standard_E64_v3
            cloud/region: westus3
        generators:
          common:
            cloud/vendor: azure
          randint:
            cpu/energy:
              min: 1
              max: 99
            mem/energy:
              min: 1
              max: 99
    sum:
      path: builtin
      method: Sum
      global-config:
        input-parameters:
          - cpu/energy
          - mem/energy
        output-parameter: energy
execution:
  command: >-
    /Users/jcrowley/.npm/_npx/1bf7c3c15bf47d04/node_modules/.bin/ts-node
    /Users/jcrowley/Development/gsf/if/src/if-run/index.ts -m
    manifests/outputs/features/append.yaml -o
    manifests/outputs/features/re-append --append
  environment:
    if-version: 0.6.0
    os: macOS
    os-version: 14.6.1
    node-version: 20.16.0
    date-time: 2024-09-04T01:05:58.758Z (UTC)
    dependencies:
      - '@babel/[email protected]'
      - '@babel/[email protected]'
      - '@commitlint/[email protected]'
      - '@commitlint/[email protected]'
      - '@grnsft/[email protected]'
      - '@jest/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - '@types/[email protected]'
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
      - [email protected]
  status: success
tree:
  pipeline:
    compute:
      - mock-observations
      - sum
    regroup:
      - cloud/region
      - name
  defaults: null
  children:
    westus3:
      children:
        server-1:
          inputs:
            - timestamp: '2024-03-05T00:00:00.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 5
              mem/energy: 10
            - timestamp: '2024-03-05T00:00:01.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 71
              mem/energy: 5
            - timestamp: '2024-03-05T00:00:02.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 36
              mem/energy: 74
          outputs:
            - timestamp: '2024-03-05T00:00:00.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 5
              mem/energy: 10
              energy: 15
            - timestamp: '2024-03-05T00:00:01.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 71
              mem/energy: 5
              energy: 76
            - timestamp: '2024-03-05T00:00:02.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 36
              mem/energy: 74
              energy: 110
            - timestamp: '2024-03-05T00:00:04.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 2
              mem/energy: 26
              energy: 28
            - timestamp: '2024-03-05T00:00:05.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 67
              mem/energy: 27
              energy: 94
            - timestamp: '2024-03-05T00:00:06.000Z'
              duration: 1
              name: server-1
              cloud/instance-type: Standard_E64_v3
              cloud/region: westus3
              cloud/vendor: azure
              cpu/energy: 88
              mem/energy: 6
              energy: 94

which includes the additional outputs. Ideally would include this as an integration test if we can codify how we want to pass 'append' into the tests.

Let me know if you get the above, or if I've misunderstood somewhere along the way.

Sep 04 '24 01:09 jamescrowley

got it - thanks, yes I have consistent results to you now. Will dig in and try to understand your changes a bit more today - all seems to behave as expected so far. Then will pass to @manushak to approve.

Sep 04 '24 09:09 jmcook1186

Hi @jamescrowley, I've reviewed the PR, and it seems the append works only once. I've recorded a short video. In the video, it is shown that the last output data from 04-07 is missing

Sep 09 '24 11:09 manushak

@manushak Thanks for taking a look. The current implementation appends to the outputs that already exist in the input manifest, but it does not append to any outputs that exist in a separate output manifest if supplied - that is always overwritten.

So in the example given, if you then supplied append-output.yaml as the input manifest for the second run (having modified the mock parameters) then it would behave as I believe you desired.

If the intent is to append to both outputs on the input manifest and outputs to an existing output manifest (if it exists) I’d need to do some more digging - and probably clarify with you all on other scenarios needing support here.

Hope that makes sense!

Sep 09 '24 11:09 jamescrowley

Yeah, you're right, @jamescrowley. I'm sorry for the misunderstanding. It works as expected.

Sep 09 '24 11:09 manushak

if if copied to clipboard

feat(src): add support for appending to existing outputs

Types of changes

A description of the changes proposed in the Pull Request

if
if copied to clipboard