fury [Question] Could use Python bindings instead of Cython

From the code, the Buffer in Cython is just a wrapper around the Buffer in cpp. It might be beneficial to use Python bindings directly, as this could reduce performance overhead and the cost of code maintenance.

Oct 16 '24 10:10 penguin-wwy

You mean implement Buffer using python only instead of CBuffer? We implement some varint encoding in Cpp Buffer. I'm not sure using python will make it faster, maybe need some benchmarks

Oct 16 '24 11:10 chaokunyang

My suggestion is to use pybind11 or directly write a C-API to call the cpp fury::Buffer. The current Cython-generated C++ code still calls the fury::Buffer, which I think is unnecessary.

static CYTHON_INLINE int64_t __pyx_f_6pyfury_5_util_6Buffer_get_int64(struct __pyx_obj_6pyfury_5_util_Buffer *__pyx_v_self, uint32_t __pyx_v_offset, CYTHON_UNUSED int __pyx_skip_dispatch) {
  ...

  /* "pyfury/_util.pyx":140
 *     cpdef inline int64_t get_int64(self, uint32_t offset):
 *         self.check_bound(offset, <int32_t>8)
 *         return self.c_buffer.get().GetInt64(offset)             # <<<<<<<<<<<<<<
 * 
 *     cpdef inline float get_float(self, uint32_t offset):
 */
  __pyx_r = __pyx_v_self->c_buffer.get()->GetInt64(__pyx_v_offset);  // This is an indirect call, which is not really necessary.
  goto __pyx_L0;

  /* function exit code */
  ...
}

If you agree, I will try to provide a simple demo to verify the performance and code maintenance benefits.

Oct 16 '24 12:10 penguin-wwy

That would be great, if we get a big performance gain, we can compromise some code maintenance costs

Oct 16 '24 12:10 chaokunyang

Hi, I conducted comparative experiments, trying out pybind11, nanobind, and directly writing C-API code.

pybind11 had the worst performance, which aligns with my understanding. It doesn't perform any specific optimizations for different scenarios and has relatively complex type conversion operations. However, its maintenance code is the simplest. For code that only requires API binding, it can be written as follows:

PYBIND11_MODULE(_pyutil, util_mod) {
  py::class_<fury::Buffer>(util_mod, "Buffer")
      .def(py::init<>())
      .def("own_data", &fury::Buffer::own_data)
      .def("reserve", &fury::Buffer::Reserve)
      .def("put_bool", [](fury::Buffer &self, uint32_t offset,
                          bool v) { self.UnsafePutByte(offset, v); })
      .def("put_int8", [](fury::Buffer &self, uint32_t offset,
                          int8_t v) { self.UnsafePutByte(offset, v); })
      .def("get_bool", &fury::Buffer::GetBool)
      .def("get_int8", &fury::Buffer::GetInt8)
      ...
      .def_static("allocate", [](uint32_t size) { return fury::AllocateBuffer(size); });
}

Nanobind's performance is slightly better than Cython's, and its binding method is not much different from pybind11. However, it only supports Python 3.8+.
Directly writing C-API code can perform better than Cython if optimized for different versions (especially >= 3.11). However, is detrimental to the goal of maintaining code more easily. For example:
- Cython generates redundant checks when creating get_bool, and due to the unreasonable setting of ml_flag (it should choose METH_O instead of METH_FASTCALL | METH_KEYWORDS), parameter parsing also introduces additional overhead.

static PyObject *
cbuffer_get_bool(CBufferObject *self, PyObject *offset)
{
    long off_val = PyLong_AsLong(offset);
    assert(off_val <= UINT32_MAX);
    return self->buffer->GetBool(off_val) ? Py_NewRef(Py_True) : Py_NewRef(Py_False);
}

static PyMethodDef cbuffer_methods[] = {
    {"get_bool", (PyCFunction)cbuffer_get_bool, METH_O, nullptr},
    ...
    {NULL, NULL}           /* sentinel */
};

Additionally, after analyzing the Cython code, I found that some performance optimizations can be achieved by directly calling certain C-API functions in the .pyx file. The principle behind this is to use some higher-level knowledge to avoid Cython generating certain guard code. I will attempt to submit these optimizations as a PR in the future.

Oct 23 '24 11:10 penguin-wwy

Thanks for sharing so insightful experiments. I used to think pybind is fast since it avoid generating code like cython. It turns out it's the slowest, this superised me a little. Cython seems provide some annotations to help generate c++ code, are you going to employ such kind of optimization?

Oct 23 '24 14:10 chaokunyang

I suggest using Nanobind, nanobind is the next generation of pybind11 from the same guy that invented it. It is suggested officially to use nanobind over Pybind11. Python 3.8+ is not a problem since anything below is already subset by CPython.

Nov 19 '24 19:11 Superskyyy

Yeah from my side generally cython is hard to debug and maintain, also lacks good IDE/LSP support.

Many libraries with python interfaces, like tensorflow, pytorch, xla and so on, use cpython api (or in a wrapper like pybind11/nanobind) instead of cython.

Dec 06 '24 11:12 PragmaTwice

I have gained new insights into this issue:

I found that pybind is not always suitable for pyfury. In other projects where pybind is used, it is often employed as a wrapper layer to provide interfaces and data transfer, with the core content being unrelated to the Python Object structure.

However, for pyfury, it is inherently closely related to the Python Object structure. More precise parsing of the Python Object structure means more opportunities for optimization to reduce overhead.

For example, consider the python dict type. In the cpython dict structure, a linear memory stores indices and key-value pairs. By accessing this structure, we can achieve efficient traversal of the dict, similar to arrays and lists. However, accessing this memory is internal in CPython, and the access method varies across different versions.

       +----------------------------------------------------
       |  dk_indices | PyDictKeyEntry | PyDictKeyEntry | ...
       +----------------------------------------------------
      ^                    
       |
dict->ma_keys->dk_indices

Therefore, we need a general Python object access code and high-performance access code tailored to specific types and versions.

For general Python objects, Cython is more suitable than pybind. For example, in the current list or set serialization protocol code, if we use pybind, we need to manually call the C-API and handle error processing, reference counting, and version differences, which may be more difficult to maintain than using Cython.
For high-performance specialized objects, there is no difference between using Cython or pybind, as it essentially involves manually parsing PyObject.

Therefore, my idea is that we need to use Cython more lightly, using Cython to write general serialization protocols, and transferring the specialized parts to C++.

// cpp
void _write_pydict_to_buffer(PyDictObject *dict, Buffer buffer) {
#if PYTHON_VERSION
   PyDictKeyEntry *entries = dict->ma_keys->dk_indices[...]
#elif PYTHON_VERSION
    PyDict_NEXT(...)
#endif
}

# cython
class SetSerializer:
    ...

class ListSerializer:
    ...

class MapSerializer:
    cpdef inline write(self, Buffer buffer, o):
        _write_pydict_to_buffer(<PyDictObject *>o, buffer->c_ptr())

Dec 09 '24 17:12 penguin-wwy

I'm rewriting pyfury recently into our new xlang protocol in https://github.com/apache/fury/pull/1690. One of the biggest obstacle is that cython doesn't support switch case syntax, which is important for performance optimization. We've used swisstable for fast Type dispatch, switch can be used for fast serialization of common types without the cost of virtual method invocation.

I'm not sure whether we can invoke cython code in C++ in an inline way.

Dec 10 '24 01:12 chaokunyang

I'm rewriting pyfury recently into our new xlang protocol in #1690. One of the biggest obstacle is that cython doesn't support switch case syntax, which is important for performance optimization. We've used swisstable for fast Type dispatch, switch can be used for fast serialization of common types without the cost of virtual method invocation.

I'm not sure whether we can invoke cython code in C++ in an inline way.

The C++ functions generated by Cython store a function pointer in its simulated virtual table. Although it cannot be called in an inline way, the overhead of virtual method invocation can be reduced by optimizing access to this table.

Dec 10 '24 15:12 penguin-wwy

Hmmm from my understanding Cython is a transpiler from python (extended) to C (with CPython API calls?). So it should be relatively easy to migrate from Cython to C/C++ code mixed with CPython API (for performance-critical sections) and nanobind (for maintanance).

BTW if we still want to keep some Cython code, maybe we can try the pure python syntax of cython. Refer to https://cython.readthedocs.io/en/latest/src/tutorial/pure.html#pep484-type-annotations.

Dec 13 '24 02:12 PragmaTwice

I strongly suggest to use nanobind, it was very difficult to debug cython compiled code. And when I want to use some code in C++, I must refine the pxd header, which is not convenient. Other considerations is that it lacks complete c++ template support and don't have switch clause support.

It would be much easier if we are using nanobind to implement pyfury serialization in #1690

Dec 20 '24 15:12 chaokunyang