pantab icon indicating copy to clipboard operation
pantab copied to clipboard

Add atomic= keyword toggle file copy behavior

Open skyth540 opened this issue 4 months ago • 6 comments

Describe the bug As I append more and more to a hyper, the write time to do so goes up and up. I have a large folder with parquet files with roughly the same amount of data /size. I iterated through the folder, and as the hyper file got larger and larger from appending the data, the write times per file went from about a minute, to over 20 minutes. Is this expected? I'm curious as to why this happens

To Reproduce

folder_path = r'G:\PATH\Parquet'
hyper_path = r'G:\PATH\test.hyper'
params = {"default_database_version": "1"}

counter = 1

for file_path in glob.glob(f"{folder_path}/*.parquet"):
    print(f"({counter} / {len(glob.glob(f"{folder_path}/*.parquet"))}) Processing {os.path.basename(file_path)}:")
    hist_fct = pl.scan_parquet(file_path)
    hist_fct = hist_fct \
        .join(hist_prdc_ref, on = 'PRODUCT KEY', how = 'inner') \
        .join(hist_prd_ref, on = 'Period Key', how = 'inner') \
        .join(hist_mrkt_ref, on = 'Market Key', how = 'inner') \
        .with_columns(pl.col(pl.Float32).cast(pl.Float64))
    
    pt.frame_to_hyper(hist_fct.collect(), hyper_path, table = 'table', table_mode = 'a', process_params = params)
    counter += 1

Expected behavior Writing takes the same amount of time regardless of inital hyper size

skyth540 avatar Oct 24 '24 14:10 skyth540