fastexcel
fastexcel copied to clipboard
feat: introduce eager loading functions
What
This introduces eager loading functions that make use of the calamine's new DataTypeRef
.
This prevents some allocations, resulting in a lower memory footprint.
Caveats
-
The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
-
The functions need to be eager because
DataTypeRef
has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters) -
In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: https://github.com/tafia/calamine/pull/390
Gains
While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise :partying_face: (they still beat us by a few MBs), while being about 10 times faster
Before
After
Pandas
Some work is still required in calamine: https://github.com/tafia/calamine/pull/409
Okay well just noticed that the API changed so we actually need to use workshet_range_ref
in case Sheets
are the Xlsx
variant
Glad to see https://github.com/tafia/calamine/pull/409 has been merged. Hopefully we get a new release soon 👍
new data
main
import argparse
from time import sleep
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
# sleeping to be really visible on the resulting graph
sleep(1)
arrow_data.to_pandas()
if __name__ == "__main__":
main()
this branch
import argparse
from time import sleep
import fastexcel
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("file")
parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
return parser.parse_args()
def main():
args = get_args()
excel_file = fastexcel.read_excel(args.file)
use_columns = args.column or None
for sheet_name in excel_file.sheet_names:
arrow_data = excel_file.load_sheet_eager(sheet_name)
# sleeping to be really visible on the resulting graph
sleep(1)
arrow_data.to_pandas()
if __name__ == "__main__":
main()
New benchmark looks great 😃
Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out :partying_face:
Benchmarks with the latest version:
iterations | owned | by ref |
---|---|---|
1 | ||
20 |
calamine 0.25.0 should be released soon, meaning I should finally be able to finish this :slightly_smiling_face: https://github.com/tafia/calamine/issues/435
latest measurements with this branch
iterations | master | this branch (lazy) | this branch (eager) |
---|---|---|---|
1 | |||
20 |