lance
lance copied to clipboard
pylance write to local disk with a path started with `s3://`
With the below code, I want to write to an s3 path, but it writes to the local disk directory s3:
instead.
import lance
import duckdb
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
import shutil
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hehe').getOrCreate()
data = [("James", "", "Smith", "36636", "M", 60000),
("Michael", "Rose", "", "40288", "M", 70000),
("Robert", "", "Williams", "42114", "", 400000),
("Maria", "Anne", "Jones", "39192", "F", 500000),
("Jen", "Mary", "Brown", "", "F", 0)]
columns = ["first_name", "middle_name", "last_name", "dob", "gender", "salary"]
pysparkDF = spark.createDataFrame(data=data, schema=columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)
df = pysparkDF.toPandas()
print("converted to pandas")
# df = pd.DataFrame({"a": [5]})
# shutil.rmtree("/tmp/test_df.lance", ignore_errors=True)
import os
os.environ['AWS_PROFILE'] = 'some-profile-name'
dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")
If I remove the AWS_PROFILE
environment variable, it got
File "/Users/renkaige/tubi/Grinder/python/hehe.py", line 32, in <module>
dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")
File "/usr/local/Caskroom/miniconda/base/envs/spock3/lib/python3.10/site-packages/lance/dataset.py", line 654, in write_dataset
_write_dataset(reader, uri, params)
OSError: LanceError(I/O): Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Operation timed out (os error 60)
instead
➜ python git:(master) ✗ tree s3:
s3:
└── the-bucket-name
└── the-path-under-bucket
├── _latest.manifest
├── _versions
│ └── 1.manifest
└── data
└── 6e3c3a9c-8a43-4c96-b2bf-7c8d0018170f.lance
5 directories, 3 files
The pylance version is 0.4.3
from pypi
I got err: IO("Generic S3 error: Profile support requires aws_profile feature")
by adding some log in my branch.
Here we try object store first. If it fails, then try to write a local path with the same URL. If the failed URL use an object store, the local path shall not be the fallback.
https://github.com/eto-ai/lance/blob/d66f2b3887c4fd75bbefbc3e0e055eba9ce618ad/rust/src/io/object_store.rs#L101-L102
Thanks for the bug submission @Renkai! Appreciate you attaching a PR. We had to move away from handling ParseError::RelativeUrlWithoutBase
because it doesn't work well with Windows paths.
This part of the code was a little brittle so I refactored it to make it more explicit what schemes for local and remote file development. It will be part of the next lance release
One final thing to test:
if you don't have any s3 credentials setup, writing to s3 should raise an Exception and not silently write to local drive.