bids-specification
bids-specification copied to clipboard
Removing Pandas

bidsschematools.render imports Pandas and bidsschematools.render.utils includes a lot of functions which operate on dataframes.
While we have moved pandas to optional dependency status and smart package managers will even omit installing the render module to avoid errors in such a case, this still means we are potentially requiring over 350MB of dependency stack to simply render a table (well, not even to render it, really):
chymera@darkhost ~/src/bids-specification/tools/schemacode/bidsschematools $ du -sh /usr/lib/python3.10/site-packages/pandas/
87M /usr/lib/python3.10/site-packages/pandas/
chymera@darkhost ~/src/bids-specification/tools/schemacode/bidsschematools $ du -sh /usr/lib/python3.10/site-packages/numpy
40M /usr/lib/python3.10/site-packages/numpy
chymera@darkhost ~/src/bids-specification/tools/schemacode/bidsschematools $ du -sh /usr/lib/python3.10/site-packages/matplotlib
70M /usr/lib/python3.10/site-packages/matplotlib
chymera@darkhost ~/src/bids-specification/tools/schemacode/bidsschematools $ du -sh /usr/lib/python3.10/site-packages/scipy
100M /usr/lib/python3.10/site-packages/scipy
chymera@darkhost ~/src/bids-specification/tools/schemacode/bidsschematools $ du -sh /usr/lib/python3.10/site-packages/statsmodels
70M /usr/lib/python3.10/site-packages/statsmodels
Additiionally, pandas' sprawling dependency graph can also impede support for newer Python versions (currently pandas is still not marked as working on Python 3.11 because jedi, a dependency of its optional dependency ipython, is having issues with it).
Since we are only using it to pass an input to tabulate, which works perfectly well on a list of lists, and since we're not using any of the numerical/statistical features which are the actual purpose of pandas, I think it would be a good idea to fully drop it. Really the only thing we need as far as I've read into the code is filtering, which can be done perfectly well and perhaps even easier and/or more transparently with vanilla Python. @effigies suggested raccoon (small package, basically no dependencies), but that can't really be used as a drop-in replacement, and the author states on the readme that he hopes the package will be deprecated in favour of better pandas support for rapid line adding. So given all of that I just tried to go ahead and duplicate functions which use pandas and add transition tests of the following form to ascertain whether we can get the same results without it:
def test_make_entity_table_transition(schema_obj):
entity_table = render.make_entity_table(schema_obj)
_entity_table = render._make_entity_table(schema_obj)
assert entity_table == _entity_table
Now, this will take more time, but I'm already noticing an issue, namely that the resulting strings are very large and confusing to diff. @tsalo do you perhaps have any suggestion of how to easily view the outputs based on what you used during development? Also, starting with the entitiy table was probably a bad decision since it's one of our bigger ones... do you perhaps know which pandas-using function is generating the smallest table so I can start with that one and work my way up?
We should check out polars. It seems much more lightweight than pandas, is actively developed. And I was able to replace
pddf = pd.DataFrame.from_records(list(schema.objects.suffixes.values()))
With:
pldf = pl.from_records(list(schema.objects.suffixes.to_dict().values()))
how does this relate to #1214?
Yep, forgot there was one. In any case there's more info here. @sappelhoff regarding the point you raised over there --- I too agree that a widespread syntax has advantages over a more niche dependency, which is why I think vanilla python would be worth exploring (since pandas is somewhat niche as well). If that doesn't work, polars might still be good... going by the example above it seems fairly syntax similar to pandas.
my 2 cents on this are:
since pandas is somewhat niche as well
I don't consider pandas niche, but a core package on a level with numpy, scipy, and matplotlib -- every scientist working with Python will know of pandas
polars might still be good
I am not convinced by the benefits of dropping pandas over another library, do I understand correctly that these are the benefits?:
- not having to download as many MB per installation (~300MB or so less)
- being able to more quickly adopt the newest Python versions
vanilla python would be worth exploring
if that would be viable with a few functions that re-use dependencies that we already have, I would prefer that route.