markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Feature: Support XLSM, XLSB & Replace excel engine with faster calamine

Open yeungadrian opened this issue 1 month ago • 0 comments

Background

Python-Calamine is a rust based excel reader(https://pypi.org/project/python-calamine/) which supports .xls, .xlsx, .xlsm, .xlsb (also .ods) Calamine should provide a "free" bump in performance. Using calamine & tabulate over pandas should also result in fewer dependencies Should allow a cleaner solution to #52 around filtering empty rows & columns with a small additional change

Changes

  • Remove pandas + openpyxl & xlrd -> calamine & tabulate instead
  • Support xlsm & xlsb

Open questions

  • Argument against could be that python-calamine is not as well known
  • Could just use tabulate, instead of additional html to str conversion
  • Future pr could add argument of skip_empty_area to cli / argument

yeungadrian avatar Jan 03 '25 23:01 yeungadrian