pantab icon indicating copy to clipboard operation
pantab copied to clipboard

Add atomic keyword

Open WillAyd opened this issue 1 year ago • 2 comments

closes https://github.com/innobi/pantab/issues/380

WillAyd avatar Oct 28 '24 13:10 WillAyd

Hi @skyth540 - if you get the chance to test this out would greatly appreciate it. Should resolve the performance issues you have seen when looping appends if you add atomic=False to your keywords

The risk to this keyword is that the Hyper file could end up in a corrupt state if any loop iteration fails

WillAyd avatar Oct 28 '24 13:10 WillAyd

You can install from this branch with:

pip install git+https://github.com/innobi/pantab.git@add-atomic-keyword

WillAyd avatar Oct 28 '24 14:10 WillAyd

From what I can tell, it didn't make any change... each iteration still takes longer and longer

skyth540 avatar Oct 29 '24 17:10 skyth540

Hmm that's unfortunate. Do you have any code I can use to reproduce?

WillAyd avatar Oct 29 '24 18:10 WillAyd

If its not the file copy that is the problem then there might be some limitations with the Hyper API around its insertion time. We can ask that team but would be great to rule out other issues with a self-contained example first!

WillAyd avatar Oct 29 '24 18:10 WillAyd

@skyth540 this might be a better MRE:

import pandas as pd
import numpy as np
import pantab as pt

import time

df = pd.DataFrame(np.random.randn(100_000, 10), columns=list("abcdefghij"))
for i in range(100):
    start = time.time()
    pt.frame_to_hyper(
        df,
        "example.hyper",
        table = 'table',
        table_mode = 'a',
    )
    end = time.time()
    print(f"Iteration {i} took {end - start}")

Running that yields the following runtime for me:

image

Adding atomic=False:

import pandas as pd
import numpy as np
import pantab as pt

import time

df = pd.DataFrame(np.random.randn(100_000, 10), columns=list("abcdefghij"))
for i in range(100):
    start = time.time()
    pt.frame_to_hyper(
        df,
        "example.hyper",
        table = 'table',
        table_mode = 'a',
        atomic=False,
    )
    end = time.time()
    print(f"Iteration {i} took {end - start}")

made that appear much closer to constant time

image

Do you see the same results?

WillAyd avatar Oct 30 '24 18:10 WillAyd

Merging for now as I want to cut a release candidate soon. If you can provide a reproducible MRE for whatever issue remains let's open a new issue and can take a look

WillAyd avatar Oct 31 '24 13:10 WillAyd