lance icon indicating copy to clipboard operation
lance copied to clipboard

lance `storage_options` doesn't work well for MinIO s3 backend

Open zeddit opened this issue 1 year ago • 13 comments

the version of lance I used

v0.10.9

the problem

I cannot write_dataset because it cannot connect to the backend MinIO OSS.

how to reproduce?

the code I used is pasted below.

import os 
os.environ["RUST_LOG"] = "debug"
os.environ["LANCE_LOG"] = "debug"

import lance
# followed the instructions given by official site for minio
storage_options={
        "region": "us-east-1",
        "endpoint": "http://10.0.33.151:31900",
        "access_key_id": "xxx",
        "secret_access_key": "xxx",
        "allow_http": "True"
    }

import pyarrow as pa
t = pa.table({'a':[4,5]})

ds = lance.write_dataset(t,"s3://tmp/test.lance", storage_options=storage_options)

log

[2024-04-08T10:22:04Z INFO  lance::dataset] write_impl; uri="s3://tmp/test.lance"
[2024-04-08T10:22:04Z INFO  aws_config::meta::region] load_region; provider=DefaultRegionChain(RegionProviderChain { providers: [EnvironmentVariableRegionProvider { env: Env(Real) }, ProfileFileRegionProvider { provider_config: ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None } }, ImdsRegionProvider { client: LazyClient { client: OnceCell { value: None }, builder: Builder { max_attempts: None, endpoint: None, mode_override: None, token_ttl: None, connect_timeout: None, read_timeout: None, config: Some(ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None }) } }, env: Env(Real) }] })
[2024-04-08T10:22:04Z INFO  aws_config::meta::region] load_region; provider=EnvironmentVariableRegionProvider { env: Env(Real) }
[2024-04-08T10:22:04Z INFO  aws_config::meta::region] load_region; provider=ProfileFileRegionProvider { provider_config: ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None } }
[2024-04-08T10:22:04Z DEBUG aws_config::fs_util] loaded home directory src="HOME"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Config)
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="/home/zhangdai" path="~/.aws/config"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/config" after="/home/zhangdai/.aws/config"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/config
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/config") size=0
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Credentials)
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="[/home/zhangdai](https://jupyterhub.in.aitopia.tech/user/zhangdai/lab/workspaces/auto-1/tree/Untitled19.ipynb)" path="~/.aws/credentials"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/credentials" after="/home/zhangdai/.aws/credentials"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/credentials
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/credentials") size=0
[2024-04-08T10:22:04Z DEBUG tracing::span] imds_load_region;
[2024-04-08T10:22:04Z INFO  aws_config::meta::region] load_region; provider=ImdsRegionProvider { client: LazyClient { client: OnceCell { value: None }, builder: Builder { max_attempts: None, endpoint: None, mode_override: None, token_ttl: None, connect_timeout: None, read_timeout: None, config: Some(ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None }) } }, env: Env(Real) }
[2024-04-08T10:22:04Z DEBUG aws_config::fs_util] loaded home directory src="HOME"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Config)
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="[/home/zhangdai](https://jupyterhub.in.aitopia.tech/user/zhangdai/lab/workspaces/auto-1/tree/Untitled19.ipynb)" path="~/.aws/config"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/config" after="/home/zhangdai/.aws/config"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/config
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/config") size=0
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Credentials)
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="[/home/zhangdai](https://jupyterhub.in.aitopia.tech/user/zhangdai/lab/workspaces/auto-1/tree/Untitled19.ipynb)" path="~/.aws/credentials"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/credentials" after="/home/zhangdai/.aws/credentials"
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/credentials
[2024-04-08T10:22:04Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/credentials") size=0
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation; operation="get"
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:04Z DEBUG aws_smithy_http_tower::map_request] async_map_request; name="attach_imds_token"
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation; operation="get-token"
[2024-04-08T10:22:04Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:04Z DEBUG aws_smithy_http_tower::map_request] map_request; name="generate_user_agent"
[2024-04-08T10:22:04Z DEBUG tracing::span] dispatch;
[2024-04-08T10:22:04Z DEBUG hyper::client::connect::http] connecting to 169.254.169.254:80
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; status="dispatch_failure"
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; message=dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }))
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; status="construction_failure"
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; message=failed to construct request: failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (ConstructionFailure(ConstructionFailure { source: FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }) }))
[2024-04-08T10:22:05Z WARN  aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }))
[2024-04-08T10:22:05Z INFO  aws_config::meta::region] load_region; provider="us-west-2"
[2024-04-08T10:22:05Z INFO  aws_config::meta::region] load_region; provider=DefaultRegionChain(RegionProviderChain { providers: [EnvironmentVariableRegionProvider { env: Env(Real) }, ProfileFileRegionProvider { provider_config: ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None } }, ImdsRegionProvider { client: LazyClient { client: OnceCell { value: Some(Ok(Client { inner: ClientInner { endpoint: http://169.254.169.254/, smithy_client: Client { connector: DynConnector, middleware: ImdsMiddleware { token_loader: ImdsTokenMiddleware }, retry_policy: Standard { config: Config { initial_retry_tokens: 500, retry_cost: 5, no_retry_increment: 1, timeout_retry_cost: 10, max_attempts: 4, initial_backoff: 1s, max_backoff: 20s, base: 0x7f733eb5ead0 }, shared_state: CrossRequestRetryState { quota_available: Mutex { data: 500, poisoned: false, .. } } }, reconnect_mode: ReuseAllConnections, operation_timeout_config: OperationTimeoutConfig { operation_timeout: None, operation_attempt_timeout: None }, sleep_impl: Some(SharedAsyncSleep(TokioSleep)) } } })) }, builder: Builder { max_attempts: None, endpoint: None, mode_override: None, token_ttl: None, connect_timeout: None, read_timeout: None, config: Some(ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None }) } }, env: Env(Real) }] })
[2024-04-08T10:22:05Z INFO  aws_config::meta::region] load_region; provider=EnvironmentVariableRegionProvider { env: Env(Real) }
[2024-04-08T10:22:05Z INFO  aws_config::meta::region] load_region; provider=ProfileFileRegionProvider { provider_config: ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None } }
[2024-04-08T10:22:05Z DEBUG tracing::span] imds_load_region;
[2024-04-08T10:22:05Z INFO  aws_config::meta::region] load_region; provider=ImdsRegionProvider { client: LazyClient { client: OnceCell { value: Some(Ok(Client { inner: ClientInner { endpoint: http://169.254.169.254/, smithy_client: Client { connector: DynConnector, middleware: ImdsMiddleware { token_loader: ImdsTokenMiddleware }, retry_policy: Standard { config: Config { initial_retry_tokens: 500, retry_cost: 5, no_retry_increment: 1, timeout_retry_cost: 10, max_attempts: 4, initial_backoff: 1s, max_backoff: 20s, base: 0x7f733eb5ead0 }, shared_state: CrossRequestRetryState { quota_available: Mutex { data: 500, poisoned: false, .. } } }, reconnect_mode: ReuseAllConnections, operation_timeout_config: OperationTimeoutConfig { operation_timeout: None, operation_attempt_timeout: None }, sleep_impl: Some(SharedAsyncSleep(TokioSleep)) } } })) }, builder: Builder { max_attempts: None, endpoint: None, mode_override: None, token_ttl: None, connect_timeout: None, read_timeout: None, config: Some(ProviderConfig { env: Env(Real), fs: Fs(Real), sleep: Some(SharedAsyncSleep(TokioSleep)), region: None }) } }, env: Env(Real) }
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; operation="get"
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:05Z DEBUG aws_smithy_http_tower::map_request] async_map_request; name="attach_imds_token"
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; operation="get-token"
[2024-04-08T10:22:05Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:05Z DEBUG aws_smithy_http_tower::map_request] map_request; name="generate_user_agent"
[2024-04-08T10:22:05Z DEBUG tracing::span] dispatch;
[2024-04-08T10:22:05Z DEBUG hyper::client::connect::http] connecting to 169.254.169.254:80
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; status="dispatch_failure"
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; message=dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }))
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; status="construction_failure"
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; message=failed to construct request: failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (ConstructionFailure(ConstructionFailure { source: FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }) }))
[2024-04-08T10:22:06Z WARN  aws_config::imds::region] failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }))
[2024-04-08T10:22:06Z INFO  aws_config::meta::region] load_region; provider="us-west-2"
[2024-04-08T10:22:06Z DEBUG tracing::span] build_profile_provider;
[2024-04-08T10:22:06Z DEBUG aws_sdk_sts::config] using retry strategy with partition 'sts'
[2024-04-08T10:22:06Z DEBUG aws_config::default_provider::credentials] provide_credentials; provider=default_chain
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] load_credentials; provider=Environment
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] provider in chain did not provide credentials provider=Environment context=the credential provider was not enabled: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] load_credentials; provider=Profile
[2024-04-08T10:22:06Z DEBUG aws_config::fs_util] loaded home directory src="HOME"
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Config)
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="[/home/zhangdai](https://jupyterhub.in.aitopia.tech/user/zhangdai/lab/workspaces/auto-1/tree/Untitled19.ipynb)" path="~/.aws/config"
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/config" after="/home/zhangdai/.aws/config"
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/config
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/config") size=0
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] load_config_file; file=Default(Credentials)
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] performing home directory substitution home="/home/zhangdai" path="~/.aws/credentials"
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] home directory expanded before="~/.aws/credentials" after="/home/zhangdai/.aws/credentials"
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] config file not found path=~/.aws/credentials
[2024-04-08T10:22:06Z DEBUG aws_config::profile::parser::source] config file loaded path=Some("/home/zhangdai/.aws/credentials") size=0
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] provider in chain did not provide credentials provider=Profile context=the credential provider was not enabled: No profiles were defined (CredentialsNotLoaded(CredentialsNotLoaded { source: NoProfilesDefined }))
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] load_credentials; provider=WebIdentityToken
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] provider in chain did not provide credentials provider=WebIdentityToken context=the credential provider was not enabled: $AWS_WEB_IDENTITY_TOKEN_FILE was not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "$AWS_WEB_IDENTITY_TOKEN_FILE was not set" }))
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] load_credentials; provider=EcsContainer
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] provider in chain did not provide credentials provider=EcsContainer context=the credential provider was not enabled: ECS provider not configured (CredentialsNotLoaded(CredentialsNotLoaded { source: "ECS provider not configured" }))
[2024-04-08T10:22:06Z DEBUG aws_config::meta::credentials::chain] load_credentials; provider=Ec2InstanceMetadata
[2024-04-08T10:22:06Z DEBUG aws_config::imds::credentials] loading credentials from IMDS
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; operation="get"
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:06Z DEBUG aws_smithy_http_tower::map_request] async_map_request; name="attach_imds_token"
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation;
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; operation="get-token"
[2024-04-08T10:22:06Z DEBUG aws_smithy_client] send_operation; service="imds"
[2024-04-08T10:22:06Z DEBUG aws_smithy_http_tower::map_request] map_request; name="generate_user_agent"
[2024-04-08T10:22:06Z DEBUG tracing::span] dispatch;
[2024-04-08T10:22:06Z DEBUG hyper::client::connect::http] connecting to 169.254.169.254:80
[2024-04-08T10:22:07Z DEBUG aws_smithy_client] send_operation; status="dispatch_failure"
[2024-04-08T10:22:07Z DEBUG aws_smithy_client] send_operation; message=dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }))
[2024-04-08T10:22:07Z DEBUG aws_smithy_client] send_operation; status="construction_failure"
[2024-04-08T10:22:07Z DEBUG aws_smithy_client] send_operation; message=failed to construct request: failed to load IMDS session token: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (ConstructionFailure(ConstructionFailure { source: FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }) }))
[2024-04-08T10:22:07Z DEBUG aws_config::meta::credentials::chain] provider in chain did not provide credentials provider=Ec2InstanceMetadata context=the credential provider was not enabled: could not communicate with IMDS: dispatch failure: timeout: error trying to connect: HTTP connect timeout occurred after 1s: HTTP connect timeout occurred after 1s: timed out (CredentialsNotLoaded(CredentialsNotLoaded { source: ImdsCommunicationError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) } }))
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[29], line 1
----> 1 ds2 = lance.write_dataset(t,"s3://tmp/test.lance", storage_options=storage_options)

File ~/.conda/arctic/lib/python3.10/site-packages/lance/dataset.py:2410, in write_dataset(data_obj, uri, schema, mode, max_rows_per_file, max_rows_per_group, max_bytes_per_file, commit_lock, progress, storage_options)
   2407     params["commit_handler"] = commit_lock
   2409 uri = os.fspath(uri) if isinstance(uri, Path) else uri
-> 2410 inner_ds = _write_dataset(reader, uri, params)
   2412 ds = LanceDataset.__new__(LanceDataset)
   2413 ds._ds = inner_ds

OSError: LanceError(IO): Generic N[/A](https://jupyterhub.in.aitopia.tech/A) error: Encountered internal error. Please file a bug report at https://github.com/lancedb/lance/issues. Failed to get AWS credentials: the credential provider was not enabled, /home/runner/work/lance/lance/rust/lance-io/src/object_store.rs:171:31, [/home/runner/work/lance/lance/rust/lance-table/src/io/commit.rs:100:26](https://jupyterhub.in.aitopia.tech/home/runner/work/lance/lance/rust/lance-table/src/io/commit.rs#line=99)

zeddit avatar Apr 08 '24 10:04 zeddit

there is a strange IP address 169.254.169.254:80 it keeps trying to connect. so I think the settings for endpoint didn't take effect.

zeddit avatar Apr 08 '24 10:04 zeddit

I believe this should be fixed by https://github.com/lancedb/lance/pull/2147, which will be part of the next release.

wjones127 avatar Apr 08 '24 15:04 wjones127

@wjones127 I could test it and feedback here. Is there any plan about when to release the next version of lance.

zeddit avatar Apr 08 '24 15:04 zeddit

New release is out. LMK if your issue persists or if we can close this issue.

https://github.com/lancedb/lance/releases/tag/v0.10.10

wjones127 avatar Apr 08 '24 18:04 wjones127

@wjones127 I have just tested the main method of write_dataset with mode of create, overwrite and append, which works!

And it only generates a warning like

[2024-04-09T09:09:03Z WARN  lance_table::io::commit] Using unsafe commit handler. Concurrent writes may result in data loss. Consider providing a commit handler that prevents conflicting writes.

I will test more write methods like add_column later and give feedback here.

zeddit avatar Apr 09 '24 09:04 zeddit

@wjones127 the performance is poor when writing large dataset into minio s3 backend.

I have tested the write bandwidth is about 35MB/s by testing writing a 6GB dataset with append mode, which takes about 250 seconds which is reproducible.

the tested schema of table is

date: timestamp[ns]
time: duration[ns]
sym: string
prevclose: double
open: double
high: double
low: double
close: double
volume: double
turnover: double
tradecount: double
bid1: double
bsize1: double
bid2: double
bsize2: double
bid3: double
bsize3: double
bid4: double
bsize4: double
bid5: double
bsize5: double
bid6: double
bsize6: double
bid7: double
bsize7: double
bid8: double
bsize8: double
bid9: double
bsize9: double
bid10: double
bsize10: double
ask1: double
asize1: double
ask2: double
asize2: double
ask3: double
asize3: double
ask4: double
asize4: double
ask5: double
asize5: double
ask6: double
asize6: double
ask7: double
asize7: double
ask8: double
asize8: double
ask9: double
asize9: double
ask10: double
asize10: double
avgbid: double
avgask: double
totalbsize: double
totalasize: double
iopv: double

which contains about 56 columns and each row is about 448 bytes.

my minio cluster is composed of HDD disks and has a upper write bandwidth over 500MB/s under previous tests.

I wonder if it would be better by using a minio cluster with nvme flash disks. I have noticed that there is a storage_option named s3_express, I wonder if it would be better to switch on this flag.

zeddit avatar Apr 10 '24 07:04 zeddit

I have tested the write bandwidth is about 35MB/s by testing writing a 6GB dataset with append mode

I think we simply haven't done much to optimize the write speed. So there are probably some basic improvements we could make. Here's an example: https://github.com/lancedb/lance/issues/1980

wjones127 avatar Apr 10 '24 15:04 wjones127

@westonpace idk if we are benchmarking write performance yet as part of the V2 file format, but might be worth examining when we get there. It seems like wide schemas (like the one above) are worth benchmarking.

wjones127 avatar Apr 10 '24 15:04 wjones127

I have not been benchmarking write performance. I don't know that I'd expect much difference from v1. Writes should be nicely batched already by object_store.

@zeddit thanks for the feedback, this is something we want to be fast. Are you noticing better performance with other filesystems? In other words, is this poor performance isolated to minio and the storage options?

Can you provide some example code showing how you are doing the appends (e.g. what batch size, etc.)

westonpace avatar Apr 11 '24 02:04 westonpace

@westonpace I am using the default configuration for write_dataset in append mode. here is my code for keeping appending data into the storage.

storage_options={
        "region": "us-east-1",
        "endpoint": "http://xxx:31900",
        "access_key_id": "xxx",
        "secret_access_key": "xxx",
        "allow_http": "True"
    }
...
    for date in pd.date_range('2022-01-01', '2022-12-31'):
        print("processing ", date)
        time1 = time.time()
        df = load_cstick_data(q, date)
        time2 = time.time()
        if len(df) == 0:
            print("empty dataset", ", load time ", time2-time1)
            continue

        table = pa.Table.from_pandas(df)
        size_in_mb = table.get_total_buffer_size() / 1024/1024

        time3 = time.time()
        ds = lance.write_dataset(table,"s3://lance-aqdatac/cstick.lance", storage_options=storage_options, mode='append')
        time4 = time.time()

        del table
        del df
        print("processing ", date, ": load time ", time2-time1, ", store time ", time4-time3, ", write bw ", size_in_mb / (time4-time3), "MB/s")

every piece of dataframe with in a day looks like below, it's data for different stocks within the same day and rows are grouped by stock name and sorted by time. 截屏2024-04-11 13 24 00

and the log is like

processing  2022-12-18 00:00:00
empty dataset , load time  0.02532219886779785
processing  2022-12-19 00:00:00
processing  2022-12-19 00:00:00 : load time  44.19400453567505 , store time  234.0744378566742 , write bw  38.103867030709225 MB/s
processing  2022-12-20 00:00:00
processing  2022-12-20 00:00:00 : load time  42.77091145515442 , store time  221.6676104068756 , write bw  39.01740158925599 MB/s
processing  2022-12-21 00:00:00
processing  2022-12-21 00:00:00 : load time  43.32177972793579 , store time  225.58643007278442 , write bw  37.925962583030326 MB/s
processing  2022-12-22 00:00:00
processing  2022-12-22 00:00:00 : load time  43.047850131988525 , store time  225.87686729431152 , write bw  38.28062925155262 MB/s
processing  2022-12-23 00:00:00
processing  2022-12-23 00:00:00 : load time  41.90263867378235 , store time  221.04602098464966 , write bw  38.45453332587091 MB/s
processing  2022-12-24 00:00:00
empty dataset , load time  0.4522101879119873
processing  2022-12-25 00:00:00
empty dataset , load time  0.027242660522460938
processing  2022-12-26 00:00:00
processing  2022-12-26 00:00:00 : load time  45.25144124031067 , store time  220.3717076778412 , write bw  38.28469509070068 MB/s
processing  2022-12-27 00:00:00
processing  2022-12-27 00:00:00 : load time  41.54160690307617 , store time  227.8965196609497 , write bw  37.161483162551676 MB/s
processing  2022-12-28 00:00:00
processing  2022-12-28 00:00:00 : load time  51.093907833099365 , store time  218.39990854263306 , write bw  38.846988444040946 MB/s
processing  2022-12-29 00:00:00
processing  2022-12-29 00:00:00 : load time  41.18129205703735 , store time  218.02396368980408 , write bw  38.74893823399365 MB/s
processing  2022-12-30 00:00:00
processing  2022-12-30 00:00:00 : load time  41.51966452598572 , store time  206.42415380477905 , write bw  40.5470651685072 MB/s
processing  2022-12-31 00:00:00
empty dataset , load time  0.05005764961242676

Are you noticing better performance with other filesystems? In other words, is this poor performance isolated to minio and the storage options?

I have tested the write with pyiceberg v0.6.0 and arcticDB v4.3.1, it's much faster than lance. arcticDB is super fast and I know it achieve this by parallel writings. I have tested writing the same dataset of 6GB with arcticDB, it takes about 4.3 s, which is almost the line rate of our 10Gbps NIC. pyiceberg takes the comparable time consumed.

zeddit avatar Apr 11 '24 05:04 zeddit

@westonpace one more thing, I have write about 2TB data in total into the lance dataset like the way above, and I found the dataset became super slow if I want to query the whole data of a single stock out, which is just about 500MB.

my code looks like below:

ds.to_table(filter="sym = '000002.SZ'").to_pandas()

and it takes much time than I expected to return.

I know it's better to partition the dataset by stock name but currently lance doesn't support that, so I expect the feature of filter push-down will help me to reduce the IO then the data load time.

I am not sure if the long load time is caused by the read performance just like write performance or by that filter push-down didn't filter out too much RGs.

how could I know the actual data amount loaded by lance, I could do some more benchmarks and feedback here.

zeddit avatar Apr 11 '24 05:04 zeddit

Thinking back, our bad write performance on S3 might be a consequence of this change: https://github.com/lancedb/lance/pull/1921

We flush after every batch, which for datasets like this might mean making many tiny requests. Something we should probably rewrite soon.

@zeddit Have you tried bumping up the max_rows_per_group parameter? The default of 1024 is designed for heavy AI data (vectors, images, and so on), but for time series data something like 10240 or 102400 might be more appropriate.

wjones127 avatar Apr 11 '24 16:04 wjones127

@wjones127 I have increased the max_rows_per_group=1024*128 and the write performance became a little bit better but still not eye-catching enough.

As for writing about 6GB data, it needs about 1min+, and the write bandwidth is about 100MB/s.

Besides, I found the dataset is stored on s3 with no compression.

zeddit avatar Apr 12 '24 14:04 zeddit

Closing this since the original issue was resolved.

Feel free to open another issue about write performance if you want to discuss that further.

wjones127 avatar Jun 26 '24 22:06 wjones127