llama-cpp-python Workflow update

Add CPU with AVX, AVX2, AVX512 with OpenBlas & Remove unnecesary 32 bits wheels

Without AVX Utuntu, Windows => 32 bits, mac => 64 bits
AVX : Ubuntu, Windows, Mac => 32/64 bits
AVX2: Ubuntu, Windows, Mac => 64 bits
AVX512 Ubuntu, Windows, Mac => 64 bits

CUDA compiled with AVX Remove Python 3.8 Remove macos-11 deprecated Add python 3.9 when missing Upgrade macos-13 to macos-latest in tests Upgrade ubuntu-20.04 to ubuntu-latest Upgrade windows-2019 to windows-latest refactoring of metal building

Tests 11 may 24 CPU Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044907773 CUDA Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044908928 Metal Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044910394

Apr 30 '24 19:04 Smartappli

https://github.com/abetlen/llama-cpp-python/issues/1342#issuecomment-2054099460

I'll paste my comment here, and maybe we can open a new discussion, basically I'm concerned about the size of releases ballooning with the number of prebuilt wheel variants. I had some suggestions for long term solutions there but not sure what the right approach is.

Anecdotally @oobabooga claims to have run into issues with GitHub throttling his prebuilt wheel repo because of this.

Apr 30 '24 22:04 abetlen

If you generate too many wheels, there is a 100% chance you will reach a storage quota and GitHub will ask you to start paying for storage else your wheels will fail to upload. It's not too expensive (a few $ a month at most), but it's worth keeping in mind.

May 01 '24 00:05 oobabooga

I avoided the API quota limit problems by adding a timer in my yaml

- name: ⌛ rate 1
    shell: pwsh
    run: |
      # add random sleep since we run on fixed schedule
      sleep (get-random -max 1200)
      
      # get currently authenticated user rate limit info
      $rate = gh api rate_limit | convertfrom-json | select -expandproperty rate
  
      # if we don't have at least 100 requests left, wait until reset
      if ($rate.remaining -lt 400) {
          $wait = ($rate.reset - (Get-Date (Get-Date).ToUniversalTime() -UFormat %s))
          echo "Rate limit remaining is $($rate.remaining), waiting for $($wait) seconds to reset"
          sleep $wait
          $rate = gh api rate_limit | convertfrom-json | select -expandproperty rate
          echo "Rate limit has reset to $($rate.remaining) requests"
      }

May 01 '24 08:05 Smartappli

#1342 (comment)

I'll paste my comment here, and maybe we can open a new discussion, basically I'm concerned about the size of releases ballooning with the number of prebuilt wheel variants. I had some suggestions for long term solutions there but not sure what the right approach is.

Anecdotally @oobabooga claims to have run into issues with GitHub throttling his prebuilt wheel repo because of this.

https://github.com/Smartappli/serge-wheels/actions

May 01 '24 08:05 Smartappli

Not enabling AVX penalizes LLaMa cpp python performance in both cpu and cuda.

May 01 '24 08:05 Smartappli

Maybe the list can be shrink down a bit. For example:

Not many people have AVX512, remove until there's enough demand.
Making AVX support a minimum?
Remove python3.8, it's EOL in a few months.

May 01 '24 13:05 gaby

@Smartappli Your hanges are adding AVX for CUDA wheels, is that needed? At that point the user is using the GPU.

It makes sense for the basic wheels to have AVX, and AVX2 wheels, not so much for the CUDA ones.

May 01 '24 13:05 gaby

I copy that thx @gaby

in summary: AVX and AVX2 on CPU is enough

May 01 '24 14:05 Smartappli

@abetlen workflow update done

May 02 '24 02:05 Smartappli

@gaby @oobabooga @abetlen What do you think?

May 03 '24 10:05 Smartappli

Up to @abetlen I was going to mention the current CI (outside of this PR) is building i386 and win32 wheels, is that even necessary?

May 03 '24 12:05 gaby

Up to @abetlen I was going to mention the current CI (outside of this PR) is building i386 and win32 wheels, is that even necessary?

@gaby Make sense

May 03 '24 14:05 Smartappli

@gaby Before avx -> 32 bits avx -> 32+64 bits avx2 -> 64 bits avx512 -> 64bits

May 03 '24 18:05 Smartappli

@gaby @oobabooga @abetlen What do you think?

May 04 '24 11:05 Smartappli

@gaby thx for the code review

May 05 '24 11:05 Smartappli

Ping @gaby

May 08 '24 21:05 Smartappli

Ping @gaby

It's up to @abetlen :-)

May 09 '24 00:05 gaby

Hey @Smartappli will review soon.

May 09 '24 13:05 abetlen

@abetlen I found another improvement, if you look at the release for Metal wheels https://github.com/abetlen/llama-cpp-python/releases/tag/v0.2.71-metal

Its publishing x86_64 wheels, but that's not a platform with Metal. It should only be aarch64/arm64

May 09 '24 13:05 gaby

@gaby can you not run metal on intel macs I assumed that was possible? Additionally, the metal wheels are actually fairly small / fast to build.

May 10 '24 13:05 abetlen

@abetlen It only does for intel uhd and some AMD cpu's. But apple devices are arm64

May 10 '24 14:05 gaby

@gaby can you not run metal on intel macs I assumed that was possible? Additionally, the metal wheels are actually fairly small / fast to build.

@gaby @abetlen x86_64 architecture removed

May 10 '24 14:05 Smartappli

Tests 11 may 24 CPU Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044907773 CUDA Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044908928 Metal Build Test: https://github.com/Smartappli/llama-cpp-python/actions/runs/9044910394

May 11 '24 16:05 Smartappli

ping @gaby

May 15 '24 21:05 Smartappli

@abetlen can you review plz?

May 20 '24 19:05 Smartappli

Hey @Smartappli thanks for your patience and the PR, busy month so just catching up on open PRs right now, do you mind splitting this one up into 2 with one that includes the following

CUDA compiled with AVX
Remove Python 3.8
Remove macos-11 deprecated
Add python 3.9 when missing
Upgrade macos-13 to macos-latest in tests
Upgrade ubuntu-20.04 to ubuntu-latest
Upgrade windows-2019 to windows-latest
refactoring of metal building

and another just for the cpu wheels changes?

Jun 04 '24 16:06 abetlen

Hey @Smartappli thanks for your patience and the PR, busy month so just catching up on open PRs right now, do you mind splitting this one up into 2 with one that includes the following
CUDA compiled with AVX
Remove Python 3.8
Remove macos-11 deprecated
Add python 3.9 when missing
Upgrade macos-13 to macos-latest in tests
Upgrade ubuntu-20.04 to ubuntu-latest
Upgrade windows-2019 to windows-latest
refactoring of metal building
and another just for the cpu wheels changes?

@abetlen Done: https://github.com/abetlen/llama-cpp-python/pull/1515

Jun 06 '24 22:06 Smartappli

Has anyone managed to fix the CUDA workflows? Mine keep failing with error

C:\Miniconda3\envs\build\include\crt/host_config.h(153): fatal error C1189: #error: -- unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk. [C:\Users\runneradmin\AppData\Local\Temp\tmpwbsbwtdg\build\CMakeFiles\CMakeScratch\TryCompile-uh6ciq\cmTC_cbbed.vcxproj]

See: https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/actions/runs/9457447475/job/26051277254.

I see that @abetlen's workflow also fail with the same error: https://github.com/abetlen/llama-cpp-python/actions/runs/9457182450/job/26051175939

Jun 13 '24 19:06 oobabooga

llama-cpp-python llama-cpp-python copied to clipboard

Workflow update

llama-cpp-python
llama-cpp-python copied to clipboard