polars icon indicating copy to clipboard operation
polars copied to clipboard

Expose Python interface for other rust applications

Open jg2562 opened this issue 4 years ago • 25 comments
trafficstars

Currently the python-rust interface is within py-polars and is only published to pypi. It would be helpful for other applications that need to pass dataframes over that inferface to have access to the Pyo3 wrapper type.

Is there any way to faciliate have access to the wrapper type to return a dataframe to python using pyo3?

jg2562 avatar Sep 10 '21 02:09 jg2562

Hi @jg2562 what would you like to do, so that I have a bit more of an understanding what is possible.

ritchie46 avatar Sep 10 '21 09:09 ritchie46

Hi @ritchie46, thanks for the reply. We are working on an application where the core is written in rust. We use Python to call functions in rust (as most the legacy code is written in Python) and we also use python for quick proof of concepts before finalizing it in rust.

For a more concrete example, we are using serde on a struct containing a DataFrame combined with zstd to create a compressed version of our data (which is nonhomogamous in terms of data types). Since rust is loading the data, we currently need to unpack the data from the dataframe into structs which can be passed back to Python.

I was wondering if there was a way to expose the Python interface as a rust library to allow for us to simply pass the DataFrame to Python directly. It seems like other libraries that are written in rust for Python that want to build off of polars will also run into this issue, so it could help them too!

jg2562 avatar Sep 10 '21 19:09 jg2562

The easiest thing to do is using arrow and pyarrow to communicate the memory. Then those arrow arrays can be used to create polars dataframes/series in python polars as well as rust polars.

This will mostly be zero copy. Here is the code polars uses to communicate between pyarrow/rust-arrow: https://github.com/pola-rs/polars/tree/master/py-polars/src/arrow_interop

ritchie46 avatar Sep 10 '21 19:09 ritchie46

Thank you so much! I will definitely look into that. Just out of curiousity, is there something that makes exposing the interface difficult?

jg2562 avatar Sep 10 '21 20:09 jg2562

Just out of curiousity, is there something that makes exposing the interface difficult?

Well.. TBH, I don't really know what exposing the interface means? Do you mean compiler rust agains python polars?

Or interact with a precompiled rust binary? Or using rust polars and send a dataframe to a python polars process?

ritchie46 avatar Sep 13 '21 09:09 ritchie46

Thats fair, its pretty vague. I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

jg2562 avatar Sep 13 '21 09:09 jg2562

I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

In that case you should use pyo3 and some copy pasting of the code snippets I referenced. That should work!

ritchie46 avatar Sep 17 '21 14:09 ritchie46

Hey @ritchie46! I ended up working on a different project for a bit but I finally got around to making a small example. I was able to get the snippets to work, so at least i can better show an example of what I was thinking and why I was wondering if the PyDataFrame could be exposed.

Here is the repo, the use case would be running the example.py but you can see that there was a lot of scripting just to emulate passing the dataframe back and forth across the ffi boundry. Lemme know what you think, and thank you so much for the direction and help!

jg2562 avatar Oct 01 '21 23:10 jg2562

Not sure if this is related. I am looking to reuse PyDataFrame in my own library built with pyo3. Is the arrow conversion as @jg2562 did the best way to do it or is there something easier/more direct? Thank you.

I would like to do something like this:

use pyo3::prelude::*;

#[pyfunction]
fn read_my_format() -> PyResult<PyDataFrame> {
    Ok(read_my_format_into_polars_df("my_file"))
}

#[pymodule]
fn my_lib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(read_my_format, m)?)?;
    Ok(())
}

MarcoLugo avatar Oct 27 '21 21:10 MarcoLugo

@MarcoLugo, After a recent update the repo that I posted breaks if you try to use some data types (DateTime64 for example). I think it would still be valuable to have access to the PyDataFrame if that's doable, since it will be properly tied to the library and isn't a hack on top of it. However, I really do not know how difficult this is, and so we should consult more with @ritchie46 since he would know much more.

jg2562 avatar Oct 28 '21 21:10 jg2562

@ritchie46 I wrote a code that converts rust dataframe to python polars dataframe

pub fn rust_dataframe_to_py_dataframe(dataframe: &mut DataFrame) -> PyResult<PyObject> {
    let dataframe = dataframe.rechunk();

    let gil = Python::acquire_gil();
    let py = gil.python();

    let names = dataframe.get_column_names();

    let pyarrow = py.import("pyarrow")?;
    let polars = py.import("polars")?;
    let rbs: Vec<PyObject> = dataframe
        .iter_chunks()
        .map(|rb| to_py_rb(&rb, &names, py, pyarrow).unwrap())
        .collect::<Vec<PyObject>>();
    let rbs: PyObject = rbs.into_py(py);
    let rbs: &PyList = rbs.extract(py)?;
    let py_table = pyarrow.getattr("Table")?.call_method1("from_batches", (rbs, ))?;
    let py_df = polars.call_method1("from_arrow", (py_table, ))?;  // << This line takes much time
    Ok(py_df.to_object(py))
}

but this takes too much time.

I guess there is much easier and faster way to convert rust dataframe to python dataframe, because python dataframe is just a wrapper of rust dataframe

But i don't know how to implement this job. Could you help me?

If it is possible to import py-polars in rust, it will be easy to implement idea above but some reason i cannot import py-polars even i add py-polars in cargo dependency (ex

[dependencies]
py-polars = { path = "polars/py-polars" }

)

gunjunlee avatar Jun 10 '22 09:06 gunjunlee

Hello, I was casually looking into this and just wanted to share some insight with @gunjunlee I'm no Rust expert, so this may be inaccurate. If so, please correct me 🙂

py-polars uses cdylib as crate-type (have a look at linkage reference), this means it cannot be imported in other crates. That specific crate-type is required by PyO3, because it needs to build a dynamic library to end up in the Python wheel. I don't have enough understanding of PyO3 and CPython internals to tell you if (and how) it's possible to create some kind of interface to just write a Rust function returning a PyDataFrame from py-polars and make everything work.

I don't think think there is any reasonable alternative to using arrow and pyarrow

cavenditti avatar Jul 26 '22 12:07 cavenditti

I've seen this issue pop up a few times in the last few days (#4264, #4212, kinda #1830). I wanted to reopen discussion to talk about creating an api that is tied the polars development for people to link against. While the current example is very works and is very helpful, it is something that has to be reimplemented in every code base making it not very ergonomic to use. It also isn't tied to development of polars since its being reimplemented, so it falls out of sync and breaks during updates in different peoples projects. @ritchie46 mentioned he was considering making an api in #4212 if he had time, if you would like help with creating it please let us know!

jg2562 avatar Aug 05 '22 21:08 jg2562

The way I've done this for my projects is to split up the python content into multiple crates. For example, I have a py-interface rlib crate that would contain #[pyfunctions], #[pyclass], etc, that can be used from other rust projects (and would be published to crates.io). Then I have a py-module cdylib crate that simply includes functions/classes from py-interface, and exports them to a #[pymodule].

In this case, we could keep py-polars as the cdylib and make a new (rlib) crate that contains the pyo3 type definitions. I can work on this if people think this is the right direction to go.

jmrgibson avatar Aug 08 '22 21:08 jmrgibson

To me, thats exactly the right direction to go! Just separating them and allowing access to py-interface on crates.io I think would greatly help the rust community to use polars.

jg2562 avatar Aug 09 '22 00:08 jg2562

@ritchie46 Do you think this is the correct approach?

jmrgibson avatar Aug 10 '22 14:08 jmrgibson

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});

jmrgibson avatar Aug 26 '22 17:08 jmrgibson

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

ritchie46 avatar Aug 26 '22 19:08 ritchie46

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I don't think I understand enough about pyo3 to figure out where the copying is happening this case.

E.g. If I want to call a python function with a dataframe I create in rust, and get a dataframe back to rust:

# module.py
def manipulate_df(df: pl.DataFrame) -> pl.DataFrame:
    ...  # user writes manipulation function here

fn main(){
  let df = df!(
      "data" => [1.0, 2.0],
      "time" => [1.0, 2.0],
  );
  
  let modified_df = Python::with_gil(|py| {
      let module = PyModule::import(py, "module")?;
      let pydf: PyDataFrame = df.into();
      let args = (pydf,);
      let result: PyDataFrame = builtins.getattr("manipulate_df")?.call1(args)?.extract()?;
      Ok(result.df)
  })?;
}

Based on the docs for Py::new, which is what the default #[pyclass] uses, this is creating a new object on the python heap. Does that mean the entire inner DataFrame is getting copied from the rust stack to the python heap?

jmrgibson avatar Aug 26 '22 21:08 jmrgibson

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

AnatolyBuga avatar Dec 14 '22 22:12 AnatolyBuga

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

You'd need to serialize the query plan. This will copy data if you use df.lazy(). If you start your query with pl.scan_x then it won't.

ritchie46 avatar Dec 15 '22 10:12 ritchie46

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I think this is a good suggestion for something to make the python interface easier for third party bindings. The example code in the python_rust_compiled_function directory only shows how to transfer a single Series through the C Data interface. The C Data interface doesn't define how to transfer an entire DataFrame per se, but you can do it by convention by calling a DataFrame a struct of all the columns in the DataFrame you wish to move. That would be helpful helper code to make available to people wanting to extend Polars but who don't have a ton of Arrow experience

kylebarron avatar Dec 23 '22 13:12 kylebarron

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

ritchie46 avatar Dec 23 '22 13:12 ritchie46

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

@ritchie46 that would be really useful, especially for types beyond Series/DataFrame (like LazyFrame). I can try helping (although I am still abit of a noob)

AnatolyBuga avatar Dec 23 '22 21:12 AnatolyBuga

I just want to echo that a succinct example of how to create a PyDataFrame in a new Rust project and pass it back into Python code would be very helpful to me and @andyjslee

iskandr avatar Dec 23 '22 21:12 iskandr

@ritchie46 mentioned on discord: https://github.com/pola-rs/pyo3-polars

kylebarron avatar Jan 07 '23 21:01 kylebarron

Yes, this is the way to go.

ritchie46 avatar Jan 08 '23 07:01 ritchie46

Thanks, the pyo3-polars crate is exactly what I was looking for!

OliverEvans96 avatar Jan 05 '24 20:01 OliverEvans96