pandantic
pandantic copied to clipboard
use df.itertuples instead of df.to_dict()
Hi there, as a key motivation here is performance, you should consider using itertuples instead of to_dict for iterating over the dataframe, as I've found the snippet below to be about 3x faster
col_names = list(df.columns)
dict_iter = (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))
apologies, performance of DataFrame.to_dict seems to have vastly improved in pandas 2.0, so this is no longer appropriate
Hi @hottwaj, thank you for suggestion. I see you found some more info and closed issue. Nonetheless, thank you for thinking along as indeed performance is an important aspect.
apologies again! I got confused. the itertuples snippet is still about twice as fast in pandas 2.0, so still worth a look at :)
It is? I will have a look, thank you for the suggestion again!
See unittest below
import unittest
import datetime
import numpy
import pandas
import timeit
def iter_df_as_dicts(df):
col_names = list(df.columns)
return (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))
class TestIterDf(unittest.TestCase):
def get_source_data(self, n = 10000):
return pandas.DataFrame({
'floats': numpy.random.random(size = n),
'ints': numpy.random.randint(0, 1000, size = n),
'strings': ['blah'] * n,
'dates': [datetime.date(2023, 1, 1)] * n,
})
def _tolist_using_iter(self, source_data: pandas.DataFrame) -> list:
return list(iter_df_as_dicts(source_data))
def _tolist_using_to_dict(self, source_data: pandas.DataFrame) -> list:
return source_data.to_dict(orient = 'records')
def test_iteration(self):
source_data = self.get_source_data()
iter_dicts = self._tolist_using_iter(source_data)
df_records = self._tolist_using_to_dict(source_data)
self.assertEqual(iter_dicts, df_records)
def test_performance(self):
source_data = self.get_source_data(n = 100000)
iter_time = timeit.timeit(lambda: self._tolist_using_iter(source_data), number = 5)
todict_time = timeit.timeit(lambda: self._tolist_using_to_dict(source_data), number = 5)
# assert that iter_df_as_dicts is at least 1.5x faster (in pandas 2.0.0 it is ~2x faster but allow margin for error)
self.assertLess(iter_time, todict_time / 1.5)
I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.
Hey thanks for this
I think this makes the most difference when you have columns of different types in your DF, rather than all columns of same type
When you have mixed type columns, iterating over rows using to_records or similar internally works by casting each row to a Series of dtype "object", which slows things down compared to iterating tuples.
If the DF contains only a single dtype (across every column), I can see how to_records would be faster
I guess I am mostly working on DFs with columns with different types (int,float,categorical,date), hence found this snippet faster for those
On Thu, 27 Jul 2023 at 11:39, Wessel Huising @.***> wrote:
I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.
— Reply to this email directly, view it on GitHub https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1653357061, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JBI5OVGHB3H52LNYK2TXSJAM3ANCNFSM6AAAAAA2P6KSOQ . You are receiving this because you were mentioned.Message ID: @.***>
It is true that this dataset only consists of float and ints. Do you have a proposal for another dummy dataset I can use for benchmarking. I have tried this approach on the dunnhumpy dataset, which is the transaction_data.csv set. I am happy to give it another spin with a more representable dataset.
you could increase n in the unittest I posted above? https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1642696875
a check could also be added to iter_df_as_dict to use to_records/similar of all columns have same dtype
you could also try e.g. https://www.kaggle.com/datasets/regulad/amazon https://www.kaggle.com/datasets/jordankrishnayah/45m-headlines-from-2007-2022-10-largest-sites
@hottwaj in upcoming changes we are looking to create this functionality.
@wesselhuising this, since it is pandas specific, should be functionality in the plugin from my perspective. df.pandantic.iterrows(MyModel)
and df.pandantic.iter_tuples(MyModel)
should both be possible. Assigning to myself, and attaching to our v0.1.0 milestone.
Closing in favor of similar, but broader, issue #27