pandantic icon indicating copy to clipboard operation
pandantic copied to clipboard

use df.itertuples instead of df.to_dict()

Open hottwaj opened this issue 1 year ago • 10 comments

Hi there, as a key motivation here is performance, you should consider using itertuples instead of to_dict for iterating over the dataframe, as I've found the snippet below to be about 3x faster

col_names = list(df.columns)
dict_iter = (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

hottwaj avatar Jul 19 '23 13:07 hottwaj

apologies, performance of DataFrame.to_dict seems to have vastly improved in pandas 2.0, so this is no longer appropriate

hottwaj avatar Jul 19 '23 13:07 hottwaj

Hi @hottwaj, thank you for suggestion. I see you found some more info and closed issue. Nonetheless, thank you for thinking along as indeed performance is an important aspect.

wesselhuising avatar Jul 19 '23 13:07 wesselhuising

apologies again! I got confused. the itertuples snippet is still about twice as fast in pandas 2.0, so still worth a look at :)

hottwaj avatar Jul 19 '23 13:07 hottwaj

It is? I will have a look, thank you for the suggestion again!

wesselhuising avatar Jul 19 '23 14:07 wesselhuising

See unittest below

import unittest
import datetime
import numpy
import pandas
import timeit

def iter_df_as_dicts(df):
    col_names = list(df.columns)
    return (dict(zip(col_names, row)) for row in df.itertuples(index = False, name = None))

class TestIterDf(unittest.TestCase):
    def get_source_data(self, n = 10000):
        return pandas.DataFrame({
            'floats': numpy.random.random(size = n),
            'ints': numpy.random.randint(0, 1000, size = n),
            'strings': ['blah'] * n,
            'dates': [datetime.date(2023, 1, 1)] * n,
        })

    def _tolist_using_iter(self, source_data: pandas.DataFrame) -> list:
        return list(iter_df_as_dicts(source_data))

    def _tolist_using_to_dict(self, source_data: pandas.DataFrame) -> list:
        return source_data.to_dict(orient = 'records')

    def test_iteration(self):
        source_data = self.get_source_data()
        iter_dicts = self._tolist_using_iter(source_data)
        df_records = self._tolist_using_to_dict(source_data)

        self.assertEqual(iter_dicts, df_records)

    def test_performance(self):
        source_data = self.get_source_data(n = 100000)

        iter_time = timeit.timeit(lambda: self._tolist_using_iter(source_data), number = 5)
        todict_time = timeit.timeit(lambda: self._tolist_using_to_dict(source_data), number = 5)

        # assert that iter_df_as_dicts is at least 1.5x faster (in pandas 2.0.0 it is ~2x faster but allow margin for error)
        self.assertLess(iter_time, todict_time / 1.5)

hottwaj avatar Jul 19 '23 20:07 hottwaj

I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.

wesselhuising avatar Jul 27 '23 10:07 wesselhuising

Hey thanks for this

I think this makes the most difference when you have columns of different types in your DF, rather than all columns of same type

When you have mixed type columns, iterating over rows using to_records or similar internally works by casting each row to a Series of dtype "object", which slows things down compared to iterating tuples.

If the DF contains only a single dtype (across every column), I can see how to_records would be faster

I guess I am mostly working on DFs with columns with different types (int,float,categorical,date), hence found this snippet faster for those

On Thu, 27 Jul 2023 at 11:39, Wessel Huising @.***> wrote:

I actually tried this on a 3M numerical dataset and it seems that it decreases performance using the method you proposed. So I will be closing this issue. Again, thank you for bringing this up.

— Reply to this email directly, view it on GitHub https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1653357061, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK6JBI5OVGHB3H52LNYK2TXSJAM3ANCNFSM6AAAAAA2P6KSOQ . You are receiving this because you were mentioned.Message ID: @.***>

hottwaj avatar Jul 27 '23 10:07 hottwaj

It is true that this dataset only consists of float and ints. Do you have a proposal for another dummy dataset I can use for benchmarking. I have tried this approach on the dunnhumpy dataset, which is the transaction_data.csv set. I am happy to give it another spin with a more representable dataset.

wesselhuising avatar Jul 27 '23 10:07 wesselhuising

you could increase n in the unittest I posted above? https://github.com/wesselhuising/pandantic/issues/11#issuecomment-1642696875

a check could also be added to iter_df_as_dict to use to_records/similar of all columns have same dtype

hottwaj avatar Jul 27 '23 10:07 hottwaj

you could also try e.g. https://www.kaggle.com/datasets/regulad/amazon https://www.kaggle.com/datasets/jordankrishnayah/45m-headlines-from-2007-2022-10-largest-sites

hottwaj avatar Jul 27 '23 11:07 hottwaj

@hottwaj in upcoming changes we are looking to create this functionality.

@wesselhuising this, since it is pandas specific, should be functionality in the plugin from my perspective. df.pandantic.iterrows(MyModel) and df.pandantic.iter_tuples(MyModel) should both be possible. Assigning to myself, and attaching to our v0.1.0 milestone.

xaviernogueira avatar Aug 27 '24 16:08 xaviernogueira

Closing in favor of similar, but broader, issue #27

xaviernogueira avatar Sep 30 '24 04:09 xaviernogueira