unitxt performance profiler with visualization

Oct 04 '24 18:10 dafnapension

E.g. profiling the run of card cards/cola.json (8096 instances in train, 455 in validation, and 1043 in test) in the non-eager (usual) mode, we get the following. Searching (ctrl-F on the browser) for "profiler_" filters out most lines, leaving the methods of the profiler (and a few more). Note that most of the runtime goes into printing (list(ms[stream_name]).

Oct 05 '24 11:10 dafnapension

And the same as above, but in eager mode. Most of the time goes to standardization, as expected, and loading counts for really loading all the instances, so it lasts longer:

Oct 05 '24 11:10 dafnapension

The same for the first part of examples/evaluate_a_judge_model_capabilities_on_arena_hard.py, which generates one stream - test of 39990 instances. First - the non-eager mode:

Oct 05 '24 11:10 dafnapension

And the above example in eager mode:

Oct 05 '24 11:10 dafnapension

That is very neat @dafnapension! can we create a mechanism to sum up the total time of few cards without the loading? how could we compare times? we need some way to measure the difference between branches

Oct 06 '24 07:10 elronbandel

Tried to suggest solutions to the important issues you raised, @elronbandel

Oct 06 '24 20:10 dafnapension

yes, @elronbandel , I am close to packing it into one python script, no shell script. coming soon.

Oct 08 '24 12:10 dafnapension

All in one python script, @elronbandel . I am not sure how to make it a GitHub action. I saw the other actions refer to the branch suggested in the PR as main. My python script compares the current branch (which I thought about as the new branch, suggested in the PR. E.g. performance_profiler in this very PR) against branch main.

Oct 08 '24 19:10 dafnapension

Also, which cards would you consider typical, representative for unitxt's users?
Those would make up the benchmark, and need be listed in cards=[..,..,..] in line ~140.

Oct 08 '24 19:10 dafnapension

Should be something in the spirit of this:

name: Test Performance

on:
  pull_request:
    branches:
      - main

jobs:
  run-performance:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout main branch
      uses: actions/checkout@v3
      with:
        ref: main

    - name: Run performance on main branch
      run: |
        python profile/card_profiler.py > main_score.txt

    - name: Save main performance result
      uses: actions/upload-artifact@v3
      with:
        name: main_score
        path: main_score.txt

    - name: Checkout PR branch
      uses: actions/checkout@v3
      with:
        ref: ${{ github.head_ref }}

    - name: Run performance on PR branch
      run: |
        python profile/card_profiler.py > pr_score.txt

    - name: Download main performance result
      uses: actions/download-artifact@v3
      with:
        name: main_score
        path: ./main_score.txt

    - name: Compare main and PR performance
      run: |
        echo "Comparing performance between main and PR"
        main_score=$(cat main_score.txt)
        pr_score=$(cat pr_score.txt)

        # Calculate percentage degradation
        if [ "$main_score" -eq 0 ]; then
          echo "Main score is 0, can't calculate degradation."
          exit 1
        fi

        degradation=$(echo "scale=2; 100 * ($main_score - $pr_score) / $main_score" | bc)

        echo "Main score: $main_score"
        echo "PR score: $pr_score"
        echo "Degradation: $degradation%"

        # Check if degradation is more than 5%
        if (( $(echo "$degradation > 5" | bc -l) )); then
          echo "Performance degradation exceeds 5%!"
          exit 1
        else
          echo "Performance is within acceptable limits."
        fi

Oct 09 '24 09:10 elronbandel