vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[BUG-REPORT] Too much memory Consumption and not releasing it.

Open ashsharma96 opened this issue 2 years ago • 17 comments

Hey There Vaex Team,

Basically I'm working over some task in which I'm using vaex. So in that I have 3 million rows and 7 columns dataset. I made an function in which I'm doing 3-4 groupby, 2-3 joins & 2-3 timedelta operations. Whenever i uses this function my memory usage jumps from 6.7 Gb to 18 Gb consumption and didn't release it even if function is fully executed. Even after 5 minutes it didn't release the memory due to which it reaches upto my 32GB memory. If I close my jupyter notebook then it instantly release the memory which I don't want because my 80% code still remains unexecuted due to high memory consumption. Why vaex uses this much of RAM. I have even tried gc.collect but it didn't worked for us. Attaching the python terminal screenshot. Any idea how should I solve this?

image

ashsharma96 avatar Feb 09 '22 18:02 ashsharma96

@JovanVeljanoski @maartenbreddels Any idea about this?

ashsharma96 avatar Feb 10 '22 04:02 ashsharma96

Very hard to tell without a code example that we can reproduce. Also there were bunch of questions in the form that you didn't answer.

JovanVeljanoski avatar Feb 10 '22 07:02 JovanVeljanoski

@JovanVeljanoski Thanks for replying. I'll prepare everything and share with you within an hour.

ashsharma96 avatar Feb 10 '22 07:02 ashsharma96

@JovanVeljanoski So I'll share the exact code and dataframe I'm using. So I have two vaex dataframes.

  1. Common(Details Attached): common
  2. Transacting_onetime(Details Attached): trans_one

Now you know the structure of the dataframes. Now going forward I have two function named as gap days and RMC. So after running gap days function with common dataframe my RAM usage was this much(Photo Attached Below):

before

And my gap days function looks like this:

def gap_days(common):
    result1 = common.sort(by='Date',ascending=True)
    result1['Date1'] = result1['Date'].dt.strftime('%Y-%m-%d')
    res= result1.groupby(['tm_cid', 'Date1'], agg={'col2' : 'count'})
    res= res.tm_cid.value_counts().reset_index(name='visitNum')
    res['tm_cid'] = res['index']
    del res['index']
    mapping = dict(res[['tm_cid', 'visitNum']].values)
    result1['maxvisit'] = result1['tm_cid'].map(mapping)
    result3 = result1[['Date','tm_cid','tm_mid','tm_value','tm_id','tm_reftransid','maxvisit']]
    result3 = result3
    return result3
result3=gap_days(common)

So first thing, this simple gap days function took 301 seconds(5 minutes approx) to execute for 1.75 million rows of common dataframe. Can you tell me why vaex took this much of time for this simple function.

Secondly I ran my RMC function after gap days function which took around 50 seconds to complete for 4.85 million rows of transacting_onetime dataframe. RMC function looks like this:

def RMC(Transacting_onetime):    
    df1 = Transacting_onetime.groupby(['tm_cid','tm_mid'], agg={'tm_value':'sum'})
    cart_size = Transacting_onetime.groupby(['tm_cid','tm_mid'], agg={'tm_id':'nunique','tm_reftransid':'nunique'})
    cart_size['Cart_size'] = np.divide(cart_size.tm_id, cart_size.tm_reftransid)
    cart_size = cart_size.drop(['tm_id','tm_reftransid'])

    df1 = df1.join(cart_size, on="tm_cid", how='right',lsuffix="_")
    df1 = df1.drop(['tm_mid_','tm_cid_'])

    Mydf1 = Transacting_onetime.groupby('tm_cid', agg={'Date':'max'})
    Mydf1['mini'] = np.repeat(np.datetime64('today', 'D'), Mydf1.shape[0])
    Mydf1['Date'] = Mydf1['Date'].astype('datetime64[D]')
    Mydf1['mini'] = Mydf1['mini'].astype('datetime64[D]')
    Mydf1['Recency'] = Mydf1['mini']-Mydf1['Date']
    Mydf1 = Mydf1.drop(['Date','mini'])
    Mydf1['Recency'] = Mydf1['Recency'].td.days
    df1 = df1.join(Mydf1, on="tm_cid")
    return df1

After running this RMC function my memory usage hike to 15 GB from 6.1 GB(Screenshot attached above). Attaching the memory consumption screenshot after running RMC function:

after

Conclusion: I have two questions exactly:

  1. Why gap days function took 5 minutes for execution over just 1.8 millions rows(common dataframe)?
  2. Why there is so much memory hike from 6.1 GB to 16 GB after running RMC function which executed over 4.7 million rows(transacting_onetime dataframe).

*Note - Common and transacting_onetime both are vaex dataframe.

So let me know what do you think about this. Immediate help is appreciated.

Thanks & Regards Atal Sharma

ashsharma96 avatar Feb 10 '22 08:02 ashsharma96

I appreciate the example and insight into what you are doing, but this is not something I can copy paste and run & debug. Can you provide some (fake) data with the same dtypes and so on so we can test?

I can take a look at some point later in more detail. As for immediate help.. that is quite hard unless it is a simple fix. Perhaps if @maartenbreddels has time he can look into it. For urgent and more in-depth assistance and optimization you can always reach out to vaex.io.

JovanVeljanoski avatar Feb 10 '22 09:02 JovanVeljanoski

@JovanVeljanoski If I provide some data(1k rows) then can you generate same amount of fake data(millions)?

ashsharma96 avatar Feb 10 '22 09:02 ashsharma96

Hey @JovanVeljanoski @maartenbreddels I've attached the two csv's which represents the above two data sets i.e. Common & Transacting_onetime. Can you guys check now and see the reason behind it. *Note - The data in these csv's are fake data. The amount of data is lesser than the actual one. I've tried to keep the data more a look like to original one. So please test it out. For more detail about the two dataframes you can check above comments.

ashsharma96 avatar Feb 14 '22 09:02 ashsharma96

result1['Date1'] = result1['Date'].dt.strftime('%Y-%m-%d')

grouping on string is much slower than primitive data, use `astype('datetime64[D'])

res= result1.groupby(['tm_cid', 'Date1'], agg={'col2' : 'count'})
res= res.tm_cid.value_counts().reset_index(name='visitNum')
res['tm_cid'] = res['index']

I'm confused why to do a value counts after the groupby, can't you do that in 1 groupby?

def gap_days(common): result1 = common.sort(by='Date',ascending=True) result1['Date1'] = result1['Date'].dt.strftime('%Y-%m-%d') res= result1.groupby(['tm_cid', 'Date1'], agg={'col2' : 'count'}) res= res.tm_cid.value_counts().reset_index(name='visitNum') res['tm_cid'] = res['index'] del res['index'] mapping = dict(res[['tm_cid', 'visitNum']].values)

I'm confused, visitNum is not a column in res, unless this is pandas code, are you using pandas maybe?

maartenbreddels avatar Feb 14 '22 13:02 maartenbreddels

@maartenbreddels Thank you for answering. After changing it to astype('datetime64[D]'), the time reduced from 300 seconds to 60 seconds. But I think its still too big time for 1.7 million rows, as per vaex claim of billion rows per second. Thank you for this improvisation in the code. How can I reduce more time in gap_days function? Also in this following code:

def gap_days(common):
    result1 = common.sort(by='Date',ascending=True)
    result1['Date1'] = result1['Date'].dt.strftime('%Y-%m-%d')
    res= result1.groupby(['tm_cid', 'Date1'], agg={'col2' : 'count'})
    res= res.tm_cid.value_counts().reset_index(name='visitNum')
    res['tm_cid'] = res['index']
    del res['index']
    mapping = dict(res[['tm_cid', 'visitNum']].values)
    result1['maxvisit'] = result1['tm_cid'].map(mapping)
    result3 = result1[['Date','tm_cid','tm_mid','tm_value','tm_id','tm_reftransid','maxvisit']]
    return result3
result3=gap_days(common)

This is the line of code where visitNum is creating res= res.tm_cid.value_counts().reset_index(name='visitNum') . If we use reset_index function with name parameter then it will create a dataframe of that value_count result with column name you have given. Firstly I did the groupby to get the cids which came in unique date. But actually I've to calculate the number of time a cid came with respective of date. For example my tm_cid =1 and I came 6 times of now:

02-02-2022 - 3 times 03-02-2022 - 2 times 04-02-2022 - 1 time

So total of I came 3 time with respect to unique date and value counts helps me calculate this thing after groupby. That's why I done value_counts. Also I want to add this thing that whenever I try to execute the vaex code, it mostly put load over one core of the system. I've attached one image down below in which you can see out of 4 cores, 1 core is using upto its full potential while others are at very low consumption. load I've also seen some of the same multi core not using issues in which you are suggesting vaex.delayed registered function. But the main thing is, it works mostly over series thing but mine is a dataframe. @JovanVeljanoski @maartenbreddels any idea that how can I deal with multi core not using issue?

Also It would be helpful if you can also help me with this RMC function which is giving me sudden hike in the RAM consumption even if its a very simple function. I don't think vaex should use this much of RAM so sudden.

def RMC(Transacting_onetime):    
    df1 = Transacting_onetime.groupby(['tm_cid','tm_mid'], agg={'tm_value':'sum'})
    cart_size = Transacting_onetime.groupby(['tm_cid','tm_mid'], agg={'tm_id':'nunique','tm_reftransid':'nunique'})
    cart_size['Cart_size'] = np.divide(cart_size.tm_id, cart_size.tm_reftransid)
    cart_size = cart_size.drop(['tm_id','tm_reftransid'])

    df1 = df1.join(cart_size, on="tm_cid", how='right',lsuffix="_")
    df1 = df1.drop(['tm_mid_','tm_cid_'])

    Mydf1 = Transacting_onetime.groupby('tm_cid', agg={'Date':'max'})
    Mydf1['mini'] = np.repeat(np.datetime64('today', 'D'), Mydf1.shape[0])
    Mydf1['Date'] = Mydf1['Date'].astype('datetime64[D]')
    Mydf1['mini'] = Mydf1['mini'].astype('datetime64[D]')
    Mydf1['Recency'] = Mydf1['mini']-Mydf1['Date']
    Mydf1 = Mydf1.drop(['Date','mini'])
    Mydf1['Recency'] = Mydf1['Recency'].td.days
    df1 = df1.join(Mydf1, on="tm_cid")
    return df1
df1 = RMC(Transacting_onetime)

This above function gives a sudden hike in RAM consumption over 4.8 million rows of Transacting_onetime dataframe. Function and fake data is shared above. You can use it.

Thank you for helping me reduce 300 seconds to 60 seconds for gap days function. But still 60 seconds is very much. How can I reduce it more. Secondly, How should I use all of cores, just not one? At last why there is sudden increase in RAM consumption for RMC function and how we can prevent it or reduce it to minimum. @JovanVeljanoski @maartenbreddels Please help me with this too. Regards

ashsharma96 avatar Feb 15 '22 06:02 ashsharma96

In both functions you are mixing vaex and non-vaex code (numpy/pandas). That might cause memory spikes. It might be a good idea to add bunch of print statements and time commands to figure out where the ram spikes / slowness happens. That will help you to isolate the issues.

I just don't have the time or resources to help you optimize code. If you have a particular reproducible issue, please report that and we'll be happy to take a look and hopefully fix things.

JovanVeljanoski avatar Feb 15 '22 08:02 JovanVeljanoski

@JovanVeljanoski I'm not asking to optimize my code. I'm just asking for memory increase because when I use pandas my whole code runs under 8 GB Ram and vaex uses upto 18 GB in total. That is the main reason for raising this ticket. Also vaex uses multicore/multithreading but I didn't utilizes in my code. These two are the main reasons to put this ticket. I shared the function and data because you asked for it.

ashsharma96 avatar Feb 15 '22 09:02 ashsharma96

Hey @JovanVeljanoski cart_size = Transacting_onetime.groupby(['tm_cid','tm_mid'], agg={'tm_id':'nunique','tm_reftransid':'nunique'}) This part of the code is increasing the memory suddenly. All the dataframes are in vaex format. No pandas code is added. Now any idea, why its taking this memory?

ashsharma96 avatar Feb 24 '22 05:02 ashsharma96

How many unique values do you have in tm_cid and tm_mid?

JovanVeljanoski avatar Feb 24 '22 09:02 JovanVeljanoski

@JovanVeljanoski tm_cid would have 3 million unique records and tm_mid has only 1 unique value.

ashsharma96 avatar Feb 24 '22 09:02 ashsharma96

@JovanVeljanoski any findings?

ashsharma96 avatar Feb 28 '22 05:02 ashsharma96

We found some memory issues, see https://github.com/vaexio/vaex/pull/1961 https://github.com/vaexio/vaex/pull/1893 which should be released soon

maartenbreddels avatar Mar 16 '22 16:03 maartenbreddels

There has been few releases since this thread was active. @ashsharma96 can you see if things have improved for your usecase in the latest version?

JovanVeljanoski avatar Aug 07 '22 19:08 JovanVeljanoski