fastparquet
fastparquet copied to clipboard
Ability to drop a partition (hive partitioned format)
In our application we mainly append new data to a parquet dataset, say when a user uploads content to our application. Sometimes however a user wants to close his account and we need to remove all data related to this user.
Luckily we partition by the user-id so we would simple run a "rm -rf" on those parquet files. Now obviously the issue is that these files are referenced by the _metadata.
We have been searching but can't find any ability to "edit" this top level metadata.
One option is to use the metadata_from_many but on larger datasets it's not a feasible option.
Questions we have:
- What is the impact of simply removing those prqt files?
- Are there "edit"-like libraries/tools available to modify _metadata?
Tips are appreciated, Paul
Hi @Paul424 What a coincidence :) Your ticket appears to me closely related to #676 And I am actually working on it at this very moment through PR #689
Applied to your own need, when PR #689 will be completed, following code should do the trick:
# Your 'Parquet database'
pf = ParquetFile('/home/my_data/my_database')
# Identify row groups to remove, let say for customers 'Luc' and 'Georges'
# This works because you have used partition, otherwise, there would be a risk to delete data for other customers as well,
# as it will remove complete row groups
row_groups_to_remove = filter_row_groups(pf, [['customers', 'in', ['Luc', 'Georges']])
# Well... remove
pf.remove_row_groups(row_groups_to_remove)
I will certainly be interested in your feedback once the PR is completed. Bests
For reference @Paul424 , the essence is to mutate the ParquetFile's .fmd.row_groups
list to remove what you don't want, and then use write.write_metadata
to update the _metadata file from the altered fmd
. You can do this now without the linked PR, but it would take some lines of code.
@Paul424
If you are interested in PR #689, you can clone the corresponding repo and give it a try.
Beware that if you are accessing remote parquet dataset using fsspec
, you will need to set remove_with
and open_with
parameters. Here is an example using LocalFileSystem
.
This may be subjected to change, depending @martindurant 's feedback regarding the code.
from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
remove_with = (fs.rm_file, fs.listdir, fs.rmdir)
open_with = fs.open
...
pf.remove_row_groups(rgs, open_with=open_with, remove_with=remove_with)
Bests