gokart
gokart copied to clipboard
[Feature Request] Using Polars for loading and dumping data
Hello, thank you for developing really cool tool!
Summary
I have one feature request to use Polars for loading and dumping data: Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as the memory model. If this library would support it, it would speed up the machine learning cycle even more.
Implementation idea
I have tried a very simple implementation for parquet files here. The changes are as follows.
- Add config module as gokart/config and init.py in this module.
# gokart/config/__init__.py
from gokart.config import config
from gokart.config.config import (
get_option,
set_option,
)
- Create config.py in gokart/config. This file contains "_global_config" variable, "register_option", "get_option", and "set_option" methods. "_global_config" contains global settings as dictionary and is handled by the above methods. (Currently, only the "use_polars" option is included in "_gloaval_config" by config_init.py.)
# gokart/config/config.py
from typing import Any, Dict
_global_config: Dict[str, Any] = {}
def register_option(
key: str,
val: object,
doc: str = "",
) -> None:
_global_config.update({key: val})
def get_option(
key: str,
) -> object:
assert key in _global_config, f"No such keys: {key}"
return _global_config[key]
def set_option(
key: str,
val: object,
doc: str = "",
) -> None:
assert key in _global_config, f"No such keys: {key}"
_global_config.update({key: val})
- Create config_init.py in gokart/config. This file is used for "_global_config" initialization.
# gokart/config/config_init.py
import gokart.config.config as cf
use_polars = """
: boolean
Whether to use polars instead of pandas
"""
cf.register_option(
"use_polars",
False,
use_polars,
)
- Modify gokart/init.py to include gokart.config.
# gokart/__init__.py
from gokart.config import config_init, get_option, set_option
from gokart.build import build
...
- Modify ParquetFileProcessor Class in gokart/file_processor.py to load and dump data by Polars when "use_polars" option is True.
class ParquetFileProcessor(FileProcessor):
...
def load(self, file):
# MEMO: read_parquet only supports a filepath as string (not a file handle)
if get_option("use_polars"):
return pl.read_parquet(file.name)
else:
return pd.read_parquet(file.name)
def dump(self, obj, file):
assert isinstance(obj, (pd.DataFrame, pl.internals.dataframe.frame.DataFrame)), \
f'requires pd.DataFrame or pl.internals.dataframe.frame.DataFrame, but {type(obj)} is passed.'
# MEMO: to_parquet only supports a filepath as string (not a file handle)
if isinstance(obj, pd.DataFrame):
obj.to_parquet(file.name, index=False, compression=self._compression)
else:
obj.write_parquet(file.name, compression=self._compression if self._compression is not None else 'zstd')
I am not very familiar with the best practices regarding such a option, but if you comment on what needs to be fixed, I can work on it and make a pull request.
@takeyama0 Thanks for your suggestion and implementation idea! I'm positive with supporting polars for its good performance as you suggest.
IMO, I would like to move pandas
and polars
on python extras and raise import error when the users use pandas/polars features without import it.
It is because I think there's no application using both pandas
and polars
.
@Hi-king @ujiuji1259 @mski-iksm How do you think about this?
@takeyama0 Thanks for your suggestion! I think it’s great to support Polars too.
And I basically agree with @hirosassa ’s idea to minimize dependencies, but I’m a little bit concerned about moving pandas
on extras because some common methods (like TaskOnKart.load_data_frame) already use pandas.
@hirosassa , @ujiuji1259 Thank you for your replaying! I am glad to hear your positive feedback about supporting polars.