pantab
pantab copied to clipboard
Add atomic= keyword toggle file copy behavior
Describe the bug As I append more and more to a hyper, the write time to do so goes up and up. I have a large folder with parquet files with roughly the same amount of data /size. I iterated through the folder, and as the hyper file got larger and larger from appending the data, the write times per file went from about a minute, to over 20 minutes. Is this expected? I'm curious as to why this happens
To Reproduce
folder_path = r'G:\PATH\Parquet'
hyper_path = r'G:\PATH\test.hyper'
params = {"default_database_version": "1"}
counter = 1
for file_path in glob.glob(f"{folder_path}/*.parquet"):
print(f"({counter} / {len(glob.glob(f"{folder_path}/*.parquet"))}) Processing {os.path.basename(file_path)}:")
hist_fct = pl.scan_parquet(file_path)
hist_fct = hist_fct \
.join(hist_prdc_ref, on = 'PRODUCT KEY', how = 'inner') \
.join(hist_prd_ref, on = 'Period Key', how = 'inner') \
.join(hist_mrkt_ref, on = 'Market Key', how = 'inner') \
.with_columns(pl.col(pl.Float32).cast(pl.Float64))
pt.frame_to_hyper(hist_fct.collect(), hyper_path, table = 'table', table_mode = 'a', process_params = params)
counter += 1
Expected behavior Writing takes the same amount of time regardless of inital hyper size