lance icon indicating copy to clipboard operation
lance copied to clipboard

pylance write to local disk with a path started with `s3://`

Open Renkai opened this issue 1 year ago • 3 comments

With the below code, I want to write to an s3 path, but it writes to the local disk directory s3: instead.

import lance
import duckdb
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
import shutil

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('hehe').getOrCreate()

data = [("James", "", "Smith", "36636", "M", 60000),
        ("Michael", "Rose", "", "40288", "M", 70000),
        ("Robert", "", "Williams", "42114", "", 400000),
        ("Maria", "Anne", "Jones", "39192", "F", 500000),
        ("Jen", "Mary", "Brown", "", "F", 0)]

columns = ["first_name", "middle_name", "last_name", "dob", "gender", "salary"]
pysparkDF = spark.createDataFrame(data=data, schema=columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

df = pysparkDF.toPandas()
print("converted to pandas")
# df = pd.DataFrame({"a": [5]})
# shutil.rmtree("/tmp/test_df.lance", ignore_errors=True)
import os

os.environ['AWS_PROFILE'] = 'some-profile-name'
dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")

If I remove the AWS_PROFILE environment variable, it got

  File "/Users/renkaige/tubi/Grinder/python/hehe.py", line 32, in <module>
    dataset = lance.write_dataset(df, "s3://the-bucket-name/the-path-under-bucket")
  File "/usr/local/Caskroom/miniconda/base/envs/spock3/lib/python3.10/site-packages/lance/dataset.py", line 654, in write_dataset
    _write_dataset(reader, uri, params)
OSError: LanceError(I/O): Generic S3 error: response error "request error", after 0 retries: error sending request for url (http://169.254.169.254/latest/api/token): error trying to connect: tcp connect error: Operation timed out (os error 60)

instead

➜  python git:(master) ✗ tree s3:
s3:
└── the-bucket-name
    └── the-path-under-bucket
        ├── _latest.manifest
        ├── _versions
        │   └── 1.manifest
        └── data
            └── 6e3c3a9c-8a43-4c96-b2bf-7c8d0018170f.lance

5 directories, 3 files

The pylance version is 0.4.3 from pypi

Renkai avatar Apr 24 '23 08:04 Renkai

I got err: IO("Generic S3 error: Profile support requires aws_profile feature") by adding some log in my branch.

Renkai avatar Apr 24 '23 09:04 Renkai

Here we try object store first. If it fails, then try to write a local path with the same URL. If the failed URL use an object store, the local path shall not be the fallback.

https://github.com/eto-ai/lance/blob/d66f2b3887c4fd75bbefbc3e0e055eba9ce618ad/rust/src/io/object_store.rs#L101-L102

Renkai avatar Apr 24 '23 09:04 Renkai

Thanks for the bug submission @Renkai! Appreciate you attaching a PR. We had to move away from handling ParseError::RelativeUrlWithoutBase because it doesn't work well with Windows paths.

This part of the code was a little brittle so I refactored it to make it more explicit what schemes for local and remote file development. It will be part of the next lance release

gsilvestrin avatar Apr 24 '23 22:04 gsilvestrin

One final thing to test:

if you don't have any s3 credentials setup, writing to s3 should raise an Exception and not silently write to local drive.

changhiskhan avatar Jul 02 '23 23:07 changhiskhan