What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
The functions need to be eager because DataTypeRef has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)
In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: https://github.com/tafia/calamine/pull/390

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

Dec 22 '23 14:12 lukapeschke

Some work is still required in calamine: https://github.com/tafia/calamine/pull/409

Feb 20 '24 12:02 lukapeschke

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

Feb 26 '24 13:02 lukapeschke

Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍

Feb 27 '24 16:02 PrettyWood

new data

`main`

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

branch

Feb 28 '24 11:02 lukapeschke

New benchmark looks great 😃

Feb 28 '24 11:02 PrettyWood

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:

Benchmarks with the latest version:

iterations	owned	by ref
1
20

Mar 04 '24 21:03 lukapeschke

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435

May 22 '24 08:05 lukapeschke