arrayfire-python Use CFFI instead of ctypes

Use CFFI instead of ctypes

Open thedrow opened this issue 8 years ago • 18 comments

CFFI outperforms ctypes, even on CPython. It will also make arrayfire-python the first library that enables GPU access from PyPy.

Dec 16 '15 10:12 thedrow

It will also make arrayfire-python the first library that enables GPU access from PyPy.

PyPy already supports ctypes.

$ pypy examples/helloworld/helloworld.py 
ArrayFire v3.2.2 (CUDA, 64-bit Linux, build a1d6213)
Platform: CUDA Toolkit 7.5, Driver: 358.16
[0] GeForce GTX 690, 2047 MB, CUDA Compute 3.0
-1- GeForce GTX 690, 2048 MB, CUDA Compute 3.0
arrayfire.Array()
Type: float
[5 1 1 1]
    0.7402 
    0.9210 
    0.0390 
    0.9690 
    0.9251 


('Minimum, Maximum: ', 0.039020489901304245, 0.9689629077911377)

$ pypy3 examples/helloworld/helloworld.py 
ArrayFire v3.2.2 (CUDA, 64-bit Linux, build a1d6213)
Platform: CUDA Toolkit 7.5, Driver: 358.16
[0] GeForce GTX 690, 2047 MB, CUDA Compute 3.0
-1- GeForce GTX 690, 2048 MB, CUDA Compute 3.0
arrayfire.Array()
Type: float
[5 1 1 1]
    0.7402 
    0.9210 
    0.0390 
    0.9690 
    0.9251 


Minimum, Maximum:  0.039020489901304245 0.9689629077911377

Dec 16 '15 16:12 pavanky

@thedrow using cffi would require significant re-write internally. I'll need check for a performance hit before I go ahead and do a full re-write.

Dec 16 '15 16:12 pavanky

ctypes is extremely slow on PyPy. Even slower than C extensions.

Dec 16 '15 16:12 thedrow

@thedrow You have a point there.

 $ pypy3 examples/benchmarks/monte_carlo_pi.py 
ArrayFire v3.2.2 (CUDA, 64-bit Linux, build a1d6213)
Platform: CUDA Toolkit 7.5, Driver: 358.16
[0] GeForce GTX 690, 2047 MB, CUDA Compute 3.0
-1- GeForce GTX 690, 2048 MB, CUDA Compute 3.0
Monte carlo estimate of pi on device with 1 million samples: 3.146276
Average time taken: 3.540478 ms
Monte carlo estimate of pi on host with 1 million samples: 3.141208
Average time taken: 47.459078 ms

$ python examples/benchmarks/monte_carlo_pi.py 
ArrayFire v3.2.2 (CUDA, 64-bit Linux, build a1d6213)
Platform: CUDA Toolkit 7.5, Driver: 358.16
[0] GeForce GTX 690, 2047 MB, CUDA Compute 3.0
-1- GeForce GTX 690, 2048 MB, CUDA Compute 3.0
Monte carlo estimate of pi on device with 1 million samples: 3.146276
Average time taken: 0.517330 ms
Monte carlo estimate of pi on host with 1 million samples: 3.140640
Average time taken: 345.248337 ms

However I don't know how much of that is ctypes and how much is because of garbage collection. I am not familiar with pypy and it looks like it is running out of GPU memory when I run more samples when python isn't.

This means there is a garbage collector of some kind running in the background that is not triggered because it is unaware of GPU memory. ArrayFire slows down in those situations. So I'll need to investigate how much of the overhead is from ctypes and how much it is from the garbage collection.

Dec 16 '15 16:12 pavanky

You can try setting the PYPY_GC_DEBUG environment variable to 1 to debug major collections and 2 to debug minor collections. Also vmprof is your friend.

I just realized you're not warming the JIT before you run the benchmarks. You have to run the benchmark for a few thousand loops (usually 10k is enough for any type of code) in order for the JIT to "learn" how to optimize this code. Also pypy3 is slow. It's just an initial effort to make pypy work with Python 3. Use PyPy 4.0.1 with Python 2 when benchmarking.

Dec 16 '15 17:12 thedrow

@thedrow the host side part is about 8x faster on pypy so I think it's fine.

Dec 16 '15 17:12 pavanky

But the avg time is too high, I think. Even with ctypes.

Dec 16 '15 17:12 thedrow

I don't believe that it will ever be possible to run AF Python efficiently on PyPy, because of handling of __del__ differences.

CPython will call __del__ as soon as a scope completes, exactly like C++. So it's not aware of GPU resources, it simply releases temporaries as soon as possible. PyPy won't do this, __del__ will get executed only on sweep not on the mark phase of the GC. And sweep will run occasionally when PyPy thinks it should run, and PyPy only consider regular memory usage for the schedule. So temporaries in VRAM will stack almost indefinitely in loops, which leads to out-of-device-memory errors. We had a discussion of this on Gitter a while ago.

I believe CPython's __del__ behavior should be supported by PyPy instead of GC aware hacks of AF side.

Jan 03 '17 08:01 unbornchikken

This is frustrating. Some days I just want to yell "Just use C++ if you want to use the GPU".

Jan 03 '17 09:01 pavanky

Or CPython. :) It's not a miracle why it's so popular on this field.

Jan 03 '17 11:01 unbornchikken

So PyPy will never be supported?

Jan 05 '17 12:01 thedrow

It's impossible to support with the current design. For GC-s you're gonna need some explicit scoping mechanism, like that I use in Node.js. Or we could beg for PyPy developers to call __del__ on scope exit like CPython does.

Jan 05 '17 12:01 unbornchikken

@thedrow It'll be supported as long as we can force call garbage collection. But it will have performance issues that @unbornchikken mentioned. Depending on your application you may or may not notice it.

Jan 05 '17 17:01 pavanky

Yeah, sorry, I missed a word efficiently in my previous sentence.

It's impossible to support efficiently with the current design.

Jan 06 '17 08:01 unbornchikken

No, CPython will not call del as soon as a scope completes. It calls del as soon as the last reference to an object disappears, which is very different from C++ (even if in the simplest case it gives the same result). That's why we can't implement it in PyPy.

You might have some luck in this case by using __pypy__.add_memory_pressure(n). It pretends there are n extra bytes allocated, which will make the next garbage collection cycle occur faster. You might try to call this function when allocating n bytes of GPU memory. This interface is experimental until we get good feedback that it works in practice, but it has been there for a while already.

Jan 08 '17 10:01 arigo

@arigo Thanks for taking time to comment on this. I think adding memory pressure this way will be a bit better than checking for the number locked references and force calling garbage collection to see if any of them can be cleared.

That said it is a bit hard for library developers like us who have to explain to the users why the performance of our library is worse in some cases on a supposedly better performing python implementation.

Now if there was a way to plugin a custom garbage collector for cleaning up objects of a certain class, it'll make things a lot easier.

Jan 08 '17 11:01 pavanky

Oh yes, implementing a custom garbage collector and plugging it inside an existing system with another very different garbage collector---that sounds easy to invent and exactly like what any library wants to do. </sarcasm> :-)

The current GC in PyPy is not overly complicated, but it took years (and several attempts) for us to reach that point and discover all subtle rare bugs. There are many reasons for why PyPy performs seriously better than CPython+Psyco (the 2001-2006 JIT), but the fact that the GC is not based on reference counting is one of them. Reference counting works nicely enough for non-JIT interpreters, but the JIT cannot remove all overheads the way it can with other GCs. Likely, it would end up being a larger and larger fraction of the runtime the more you optimize the JIT.

We thought in several contexts about mixed solutions, with attempts to track some (but not all) objects with a refcounting-like approach, to guarantee prompt freeing of them. Either we didn't find the correct solution yet, or (as we now think) such a solution doesn't exist.

Jan 08 '17 13:01 arigo

Also consider caching functions instead of looking them up every time.

Jul 18 '17 16:07 pavanky

arrayfire-python arrayfire-python copied to clipboard

Use CFFI instead of ctypes

arrayfire-python
arrayfire-python copied to clipboard