pandantic use df.itertuples instead of df.to

Hi there, as a key motivation here is performance, you should consider using itertuples instead of to_dict for iterating over the dataframe, as I've found the snippet below to be about 3x faster

col_names = list(df.columns)
dict_iter = (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

Jul 19 '23 13:07 hottwaj

apologies, performance of DataFrame.to_dict seems to have vastly improved in pandas 2.0, so this is no longer appropriate

Jul 19 '23 13:07 hottwaj

Hi @hottwaj, thank you for suggestion. I see you found some more info and closed issue. Nonetheless, thank you for thinking along as indeed performance is an important aspect.

Jul 19 '23 13:07 wesselhuising

apologies again! I got confused. the itertuples snippet is still about twice as fast in pandas 2.0, so still worth a look at :)

Jul 19 '23 13:07 hottwaj

It is? I will have a look, thank you for the suggestion again!

Jul 19 '23 14:07 wesselhuising

See unittest below

import unittest
import datetime
import numpy
import pandas
import timeit

def iter_df_as_dicts(df):
    col_names = list(df.columns)
    return (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

class TestIterDf(unittest.TestCase):
    def get_source_data(self, n = 10000):
        return pandas.DataFrame({
            'floats': numpy.random.random(size = n),
            'ints': numpy.random.randint(0, 1000, size = n),
            'strings': ['blah'] * n,
            'dates': [datetime.date(2023, 1, 1)] * n,
        })

    def _tolist_using_iter(self, source_data: pandas.DataFrame) -> list:
        return list(iter_df_as_dicts(source_data))

    def _tolist_using_to_dict(self, source_data: pandas.DataFrame) -> list:
        return source_data.to_dict(orient = 'records')

    def test_iteration(self):
        source_data = self.get_source_data()
        iter_dicts = self._tolist_using_iter(source_data)
        df_records = self._tolist_using_to_dict(source_data)

        self.assertEqual(iter_dicts, df_records)

    def test_performance(self):
        source_data = self.get_source_data(n = 100000)

        iter_time = timeit.timeit(lambda: self._tolist_using_iter(source_data), number = 5)
        todict_time = timeit.timeit(lambda: self._tolist_using_to_dict(source_data), number = 5)

        # assert that iter_df_as_dicts is at least 1.5x faster (in pandas 2.0.0 it is ~2x faster but allow margin for error)
        self.assertLess(iter_time, todict_time / 1.5)

Jul 19 '23 20:07 hottwaj

I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.

Jul 27 '23 10:07 wesselhuising

Hey thanks for this

I think this makes the most difference when you have columns of different types in your DF, rather than all columns of same type

When you have mixed type columns, iterating over rows using to_records or similar internally works by casting each row to a Series of dtype "object", which slows things down compared to iterating tuples.

If the DF contains only a single dtype (across every column), I can see how to_records would be faster

I guess I am mostly working on DFs with columns with different types (int,float,categorical,date), hence found this snippet faster for those

On Thu, 27 Jul 2023 at 11:39, Wessel Huising @.***> wrote:

I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.

— Reply to this email directly, view it on GitHub https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1653357061, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JBI5OVGHB3H52LNYK2TXSJAM3ANCNFSM6AAAAAA2P6KSOQ . You are receiving this because you were mentioned.Message ID: @.***>

Jul 27 '23 10:07 hottwaj

It is true that this dataset only consists of float and ints. Do you have a proposal for another dummy dataset I can use for benchmarking. I have tried this approach on the dunnhumpy dataset, which is the transaction_data.csv set. I am happy to give it another spin with a more representable dataset.

Jul 27 '23 10:07 wesselhuising

you could increase n in the unittest I posted above? https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1642696875

a check could also be added to iter_df_as_dict to use to_records/similar of all columns have same dtype

Jul 27 '23 10:07 hottwaj

you could also try e.g. https://www.kaggle.com/datasets/regulad/amazon https://www.kaggle.com/datasets/jordankrishnayah/45m-headlines-from-2007-2022-10-largest-sites

Jul 27 '23 11:07 hottwaj

@hottwaj in upcoming changes we are looking to create this functionality.

@wesselhuising this, since it is pandas specific, should be functionality in the plugin from my perspective. df.pandantic.iterrows(MyModel) and df.pandantic.iter_tuples(MyModel) should both be possible. Assigning to myself, and attaching to our v0.1.0 milestone.

Aug 27 '24 16:08 xaviernogueira

Closing in favor of similar, but broader, issue #27

Sep 30 '24 04:09 xaviernogueira

pandantic
pandantic copied to clipboard

use df.itertuples instead of df.to_dict()

pandantic pandantic copied to clipboard

use df.itertuples instead of df.to_dict()

pandantic
pandantic copied to clipboard