fastexcel icon indicating copy to clipboard operation
fastexcel copied to clipboard

feat: introduce eager loading functions

Open lukapeschke opened this issue 1 year ago • 6 comments

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

  • The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).

  • The functions need to be eager because DataTypeRef has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)

  • In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: https://github.com/tafia/calamine/pull/390

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

lukapeschke avatar Dec 22 '23 14:12 lukapeschke

Some work is still required in calamine: https://github.com/tafia/calamine/pull/409

lukapeschke avatar Feb 20 '24 12:02 lukapeschke

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

lukapeschke avatar Feb 26 '24 13:02 lukapeschke

Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍

PrettyWood avatar Feb 27 '24 16:02 PrettyWood

new data

main

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

branch

lukapeschke avatar Feb 28 '24 11:02 lukapeschke

New benchmark looks great 😃

PrettyWood avatar Feb 28 '24 11:02 PrettyWood

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:

Benchmarks with the latest version:

iterations owned by ref
1 lazy eager
20 lazy_20 eager_20

lukapeschke avatar Mar 04 '24 21:03 lukapeschke

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435

lukapeschke avatar May 22 '24 08:05 lukapeschke

latest measurements with this branch

iterations master this branch (lazy) this branch (eager)
1 master_1 lazy_1 eager_1
20 master_20 lazy_20 eager_20

lukapeschke avatar Jun 18 '24 14:06 lukapeschke