galois icon indicating copy to clipboard operation
galois copied to clipboard

Adding GPU Support

Open varun19299 opened this issue 3 years ago • 12 comments

Early ideas:

  1. Accept device flag for GF instances. If CUDA, use a cupy array.

  2. Use cupy.get_array_module for device agnostic code where possible.

  3. Pytorch-like .to(device): allow transferring between host and device(s). Internally this would just be a numpy{cupy}.asarray or Array.view(np/cp.ndarray) call.

  4. Most numpy functions in galois/field/linalg.py have corresponding cupy ones with identical syntax.

  5. Numba jit functions and ufuncs may require separate GPU implementations, especially if thread and block index need to be accessed.

varun19299 avatar Jun 13 '21 12:06 varun19299

I'll try adding a provision for 1) in the FieldArray class.

I will be testing it on Colab. Here's a notebook to get started (installs CuPy, Colab already has numpy+scipy+numba. Please check if you have turned on a GPU runtime).

varun19299 avatar Jun 20 '21 07:06 varun19299

Just wanted to say I'de be interested in trying the GPU support for cupy.matmul with GF 32 bit fieldarrays to see how they compare speed wise with regular cupy.matmul for uint32 data types. I was thinking of maybe using this library to matmul some large arrays in real time but it's too slow to do it as it currently stands. I expect even with cupy acceleration it's still going to be an order of magnitude slower than operating on native data types like uint32.

peter64 avatar Sep 17 '21 11:09 peter64

@peter64 thanks for the feedback. I'm not too surprised that matmul isn't the fastest currently.

Some clarifying questions:

  • What kind of finite field? GF(2^m) or GF(p^m)? The latter is significantly slower, and has some performance left to squeeze out, even on CPU.
  • Can you give me example matrix dimensions (e.g., (1000,2000) x (2000,3000))? I'd like to run some speed tests too.

What is your current slowdown as compared to normal integer matmul? 10x? 100x? I've seen that for GF(2^8) matrix multiplication is ~10x slower than normal integer matrix multiplication, as discussed here.

mhostetter avatar Sep 17 '21 11:09 mhostetter

I'm using GF(p^m) I think. It's 2**8 (256). Honestly I just need this to do binary multiplication modulus 2 for some kind of entropy extractor I'm trying to reproduce from some paper. Here's some output from the tests I just ran using GF(256) as GF(2^32) wasn't completing in any reasonable period of time and the docs said using a smaller p^m value might mean it could use lookup tables so I gave it a try. In reality I would prefer to use GF(2^32) I guess.

>>> import numpy as np
>>> import galois
>>> import datetime
>>> GF = galois.GF(2**8)
>>> extractor_output_size = 1024
>>> input_data = GF.Random((2048,2048));
>>> extractor = GF.Random((2048,extractor_output_size));
>>> start = datetime.datetime.now()
>>> y = np.matmul(input_data, extractor)
>>> print(datetime.datetime.now()-start)
0:06:02.313386


>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.198755

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:01.247846

>>> import cupy as cp
>>> start = datetime.datetime.now()
>>> input_data_cp, extractor_cp = cp.array(input_data_np), cp.array(extractor_np)
>>> y_cp = np.matmul(input_data_cp, extractor_cp)
>>> cp.cuda.Stream.null.synchronize()
>>> print(datetime.datetime.now()-start)
0:00:00.017759

>>> start = datetime.datetime.now()
>>> input_data_np, extractor_np = input_data.view(np.ndarray), extractor.view(np.ndarray)
>>> y_np = np.matmul(input_data_np, extractor_np)
>>> print(datetime.datetime.now()-start)
0:00:07.692863

So GF is about 50x-60x times slower than numpy 6 minutes vs ~7 seconds. cupy is about 5x faster than numpy on it's first run (with a GTX1050 and i7 quad core) but then cupy ends up being about 700x faster than numpy on subsequent runs.

peter64 avatar Sep 17 '21 11:09 peter64

@peter64 thanks for the example. Yes, that is slower than I would expect (which is ~10x slower than NumPy). Let me run some speed tests later today and maybe test a few potential speed ups. I'll report back.

mhostetter avatar Sep 17 '21 12:09 mhostetter

@peter64 can you confirm which version you're using?

mhostetter avatar Sep 17 '21 12:09 mhostetter

>>> print(galois.__version__)
0.0.21

Perhaps it's because my arrays are large enough they don't fit in the CPU cache or something...

peter64 avatar Sep 17 '21 13:09 peter64

@varun19299 and @peter64, I now have a GPU to test against. I'm considering starting work on GPU support. Do you have any updated thoughts on a desired API interface regarding transfer to/from GPU, etc? If not, I'll use my best judgment. Just wondering if you have given it any thought. Thanks.

mhostetter avatar Oct 19 '21 12:10 mhostetter

hey @mhostetter thanks so much for asking, but I have no thoughts regarding a desired API. I can't promise I will end up using the library in the end, but I am very curious to see how it will perform and will be happy to test its performance! Thanks again for writing this library and being able to add GPU support!

peter64 avatar Oct 19 '21 13:10 peter64

Hi Matt, any news on this one? GPU support would be great for high-order calculations!

geostergiop avatar Mar 12 '22 09:03 geostergiop

No update as of yet. It's going to be a big change, and just one I haven't embarked on yet. Perhaps soon.

Just curious, @geostergiop, what are you looking to speed up? I doubt "large" finite fields (those using dtype=np.object_) will improve with GPU support because currently I can't even JIT compile their ufuncs. Instead, they use pure-Python ufuncs that use Python integers (because they have unlimited size).

mhostetter avatar Mar 13 '22 00:03 mhostetter

Well, I am currently calculating about 2,000 * 19,993 * 18,431 field_traces() and respective Norms over 2^256 and 2^233 elements so it takes some time, to say the least :-) Hoped to speed up the np.arange and exponent calculations.

geostergiop avatar Mar 13 '22 07:03 geostergiop