iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Accessing Minio with Pyiceberg

Open muniatl opened this issue 1 year ago • 2 comments

Query engine

No response

Question

I have a piece of code which is working with S3 endpoint and a Sql Catalog with sqlite. However for testing, I want to be able to run it against a minio deployment that's hosted and running on localhost. I have tried various options with no luck. What are the parameters I need to pass to SqlCatalog and create_table? My code looks like this: catalog = SqlCatalog( "default", **{ "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", #"uri" : f"postgresql+psycopg2://postgres:ph1@localhost:5433/template1", "warehouse": "s3://127.0.0.1:9000/iceberg", # have tried "s3://iceberg" "s3://127.0.0.1/iceberg" and completely commenting out warehouse "s3.endpoint" : "s3://127.0.0.1:9000", #"minio-root-user": "admin", #"minio-root-password": "password", #"minio-domain" : "minio", #"s3.access-key-id": "admin", #"s3.secret-access-key": "password", }, )

table = catalog.create_table( "default1.taxi_dataset", schema=df.schema, ) OSError: When getting information for key 'iceberg/default1.db/taxi_dataset/metadata/00000-671ce9cf-73ff-49a2-a22e-408d8758625b.metadata.json' in bucket '127.0.0.1:9000': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 6, Couldn't resolve host name.

I am able to access minio server, login and able to even upload files. Any pointers on what are the valid properties to pass for minio much appreciated

muniatl avatar Jul 16 '24 16:07 muniatl

@muniatl - I think the MinIO endpoints should not use the s3:// prefix for the endpoint configuration. They should instead use the HTTP/HTTPS protocol. e.g: warehouse="s3://iceberg", # Correct S3 URI format without the endpoint s3_endpoint="http://127.0.0.1:9000", # Corrected MinIO endpoint

Could you please try this?

rggyanav avatar Jul 30 '24 06:07 rggyanav

I tried something similar with my local config:

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.catalog import load_catalog

warehouse_path = "local_s3"
catalog = SqlCatalog(
   "catalog_1",
   **{
       "uri": f"sqlite:///{warehouse_path}/catalog.db",
       "warehouse":"s3://iceberg",
       "s3.endpoint": "http://localhost:9001",
       "s3.access-key-id": "minio_user",
       "s3.secret-access-key": "minio1234",
   },
)
catalog.create_namespace_if_not_exists('test')

And then , the creation of the table raise one error.

# Define Schema for Projects Table
projects_schema = pa.schema([
    pa.field('id', pa.uint8(), nullable=False),
    pa.field('name', pa.string(), nullable=False),
    pa.field('description', pa.string()),
    pa.field('creation_date', pa.timestamp('s')),
    pa.field('modification_date', pa.timestamp('s'))
])
projects_table = catalog.create_table_if_not_exists(
    'test.projects', 
    schema=projects_schema,
)

The error:

OSError: When getting information for key 'test.db/projects/metadata/00000-5a3bb77f-7161-4bfe-a7af-b823f6f0cb71.metadata.json' in bucket 'iceberg': AWS Error UNKNOWN (HTTP status 400) during HeadObject operation: No response body.

cfrancois7 avatar Aug 27 '24 09:08 cfrancois7

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 24 '25 00:02 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Mar 11 '25 00:03 github-actions[bot]