duckdb_azure icon indicating copy to clipboard operation
duckdb_azure copied to clipboard

AzureStorageFileSystem Directory Exists not implemented

Open patialashahi31 opened this issue 1 year ago • 10 comments

What happens?

duckdb.duckdb.NotImplementedException: Not implemented Error: AzureStorageFileSystem: DirectoryExists is not implemented!

Facing while copying the duckdb table to azure

To Reproduce

Just while copying the table it will produce

OS:

Ubuntu

DuckDB Version:

0.10.0

DuckDB Client:

Python

Full Name:

Tejinderpal Singh

Affiliation:

Atlan

Have you tried this on the latest nightly build?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] Yes, I have

patialashahi31 avatar Mar 08 '24 16:03 patialashahi31

Hello yes at the moment there is some feature that are not yet available. This one for example is not implemented because we cannot yet implement it the signature of the method do not have the fileopener so we cannot access some context information that are required by the extension. I will try to see if I can make this changes. Nevertheless the notion of directory doesn't make sens with blob storage account. It can with dfs but for blob I think it will always return false :(

quentingodeau avatar Mar 13 '24 07:03 quentingodeau

Hello, to keep you update regarding this issue, the long story is available here, the short one the the duck team will change the API of the duckdb FileSystem class that have a lot of impact on a lot of extensions it will take sometime but it will arrive :)

quentingodeau avatar Mar 21 '24 18:03 quentingodeau

I am getting same error trying to write a hived geoparquet to azure blob. Is this currently not possible? or Am i missing something?

write_query = f"""
    COPY
        (
            SELECT *,
                    ST_Point(longitude, latitude) AS geom,
                    year(base_date_time) AS year,
                    month(base_date_time) AS month
            FROM read_csv('az://ais/ais2019/csv2/ais-2019-01-*.csv.zst', ignore_errors = true)
        )
    TO 'abfs://ais/parquet' (
            FORMAT PARQUET, 
            COMPRESSION ZSTD, 
            ROW_GROUP_SIZE 122_880, 
            PARTITION_BY (year, month)
    );
"""

shaunakv1 avatar Jan 08 '25 05:01 shaunakv1

Azure writes are not yet supported unfortunately

samansmink avatar Jan 08 '25 09:01 samansmink

@samansmink this comment and the following one on another issue made it seem like it works, that what got me confused.

https://github.com/duckdb/duckdb-azure/issues/44#issuecomment-2427744888

shaunakv1 avatar Jan 09 '25 03:01 shaunakv1

@samansmink In the mean while, I am considering using rclone to first generate the hive parquet locally and then sync it over. However we are working with many TBs worth of data that we have to keep updated.

Is there any way that while writing the hive locally I can get the progress/callback as each partition is written so I can just sync that over? In theory I can just sync the entire directory structure, but with the volume of the data, I will never have the entire hive locally ( space constraints). Here's what I want to achieve.

  1. Write partition ( CSVs are in glob pattern and one can generate multiple parquet files)
  2. Sync it over
  3. Delete it from local
  4. Loop to next partition write

shaunakv1 avatar Jan 09 '25 03:01 shaunakv1

@shaunakv1 the comment you link uses fsspec which is separate from the DuckDB Azure Extension and is python-only

samansmink avatar Jan 09 '25 09:01 samansmink

@samansmink I am using the same. Here's my full code and I still get the same error:

import duckdb
from dotenv import load_dotenv
import os
from fsspec import filesystem

load_dotenv()

AIS_SRC_CONNECTION_STRING = os.getenv("AIS_SRC_CONNECTION_STRING")
AIS_DEST_CONNECTION_STRING = os.getenv("AIS_DEST_CONNECTION_STRING")

duckdb.register_filesystem(
    filesystem("abfs", connection_string=AIS_DEST_CONNECTION_STRING)
)
con = duckdb.connect()

con.install_extension("azure")
con.load_extension("azure")

con.install_extension("spatial")
con.load_extension("spatial")

con.install_extension("h3", repository="community")
con.load_extension("h3")


### Create secret
create_secret = f"""    
    CREATE SECRET ais_src (
    TYPE AZURE,
    CONNECTION_STRING '{AIS_SRC_CONNECTION_STRING}'
    );
"""
con.sql(create_secret)

### configure Duckdb performance params for azure
con.sql("SET azure_http_stats = true;")
con.sql("SET azure_read_transfer_concurrency = 8;")
con.sql("SET azure_read_transfer_chunk_size = 1_048_576;")
con.sql("SET azure_read_buffer_size = 1_048_576;")

count_query = f"""
    SELECT *
    FROM 'az://<redacted>/ais-2019-01-01.csv.zst'
    LIMIT 10
"""
con.sql(count_query).show()

print(f"Writing to parquet...")

write_query = f"""
    COPY
        (
            SELECT *,
                    ST_Point(longitude, latitude) AS geom,
                    year(base_date_time) AS year,
                    month(base_date_time) AS month
            FROM read_csv('az://<redacted>/ais-2019-01-*.csv.zst', ignore_errors = true)
        )
    TO 'abfs://ais/parquet' (
            FORMAT PARQUET, 
            COMPRESSION ZSTD, 
            ROW_GROUP_SIZE 122_880, 
            PARTITION_BY (year, month)
    );
"""

con.sql(write_query).show()

shaunakv1 avatar Jan 10 '25 03:01 shaunakv1

any update here, it is breaking iceberg on Azure :( https://github.com/duckdb/duckdb-iceberg/issues/66

djouallah avatar Mar 10 '25 23:03 djouallah