duckdb_azure
duckdb_azure copied to clipboard
AzureStorageFileSystem Directory Exists not implemented
What happens?
duckdb.duckdb.NotImplementedException: Not implemented Error: AzureStorageFileSystem: DirectoryExists is not implemented!
Facing while copying the duckdb table to azure
To Reproduce
Just while copying the table it will produce
OS:
Ubuntu
DuckDB Version:
0.10.0
DuckDB Client:
Python
Full Name:
Tejinderpal Singh
Affiliation:
Atlan
Have you tried this on the latest nightly build?
I have not tested with any build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] Yes, I have
Hello yes at the moment there is some feature that are not yet available. This one for example is not implemented because we cannot yet implement it the signature of the method do not have the fileopener so we cannot access some context information that are required by the extension. I will try to see if I can make this changes. Nevertheless the notion of directory doesn't make sens with blob storage account. It can with dfs but for blob I think it will always return false :(
Hello, to keep you update regarding this issue, the long story is available here, the short one the the duck team will change the API of the duckdb FileSystem class that have a lot of impact on a lot of extensions it will take sometime but it will arrive :)
I am getting same error trying to write a hived geoparquet to azure blob. Is this currently not possible? or Am i missing something?
write_query = f"""
COPY
(
SELECT *,
ST_Point(longitude, latitude) AS geom,
year(base_date_time) AS year,
month(base_date_time) AS month
FROM read_csv('az://ais/ais2019/csv2/ais-2019-01-*.csv.zst', ignore_errors = true)
)
TO 'abfs://ais/parquet' (
FORMAT PARQUET,
COMPRESSION ZSTD,
ROW_GROUP_SIZE 122_880,
PARTITION_BY (year, month)
);
"""
Azure writes are not yet supported unfortunately
@samansmink this comment and the following one on another issue made it seem like it works, that what got me confused.
https://github.com/duckdb/duckdb-azure/issues/44#issuecomment-2427744888
@samansmink In the mean while, I am considering using rclone to first generate the hive parquet locally and then sync it over. However we are working with many TBs worth of data that we have to keep updated.
Is there any way that while writing the hive locally I can get the progress/callback as each partition is written so I can just sync that over? In theory I can just sync the entire directory structure, but with the volume of the data, I will never have the entire hive locally ( space constraints). Here's what I want to achieve.
- Write partition ( CSVs are in glob pattern and one can generate multiple parquet files)
- Sync it over
- Delete it from local
- Loop to next partition write
@shaunakv1 the comment you link uses fsspec which is separate from the DuckDB Azure Extension and is python-only
@samansmink I am using the same. Here's my full code and I still get the same error:
import duckdb
from dotenv import load_dotenv
import os
from fsspec import filesystem
load_dotenv()
AIS_SRC_CONNECTION_STRING = os.getenv("AIS_SRC_CONNECTION_STRING")
AIS_DEST_CONNECTION_STRING = os.getenv("AIS_DEST_CONNECTION_STRING")
duckdb.register_filesystem(
filesystem("abfs", connection_string=AIS_DEST_CONNECTION_STRING)
)
con = duckdb.connect()
con.install_extension("azure")
con.load_extension("azure")
con.install_extension("spatial")
con.load_extension("spatial")
con.install_extension("h3", repository="community")
con.load_extension("h3")
### Create secret
create_secret = f"""
CREATE SECRET ais_src (
TYPE AZURE,
CONNECTION_STRING '{AIS_SRC_CONNECTION_STRING}'
);
"""
con.sql(create_secret)
### configure Duckdb performance params for azure
con.sql("SET azure_http_stats = true;")
con.sql("SET azure_read_transfer_concurrency = 8;")
con.sql("SET azure_read_transfer_chunk_size = 1_048_576;")
con.sql("SET azure_read_buffer_size = 1_048_576;")
count_query = f"""
SELECT *
FROM 'az://<redacted>/ais-2019-01-01.csv.zst'
LIMIT 10
"""
con.sql(count_query).show()
print(f"Writing to parquet...")
write_query = f"""
COPY
(
SELECT *,
ST_Point(longitude, latitude) AS geom,
year(base_date_time) AS year,
month(base_date_time) AS month
FROM read_csv('az://<redacted>/ais-2019-01-*.csv.zst', ignore_errors = true)
)
TO 'abfs://ais/parquet' (
FORMAT PARQUET,
COMPRESSION ZSTD,
ROW_GROUP_SIZE 122_880,
PARTITION_BY (year, month)
);
"""
con.sql(write_query).show()
any update here, it is breaking iceberg on Azure :( https://github.com/duckdb/duckdb-iceberg/issues/66