Describe the usage question you have. Please include as many useful details as possible.

We are doing one small poc wherein we are comparing the performance when we load parquet files directly from s3 or from local file system

jupyter notebook code snippet

jupyter notebook and s3 is in same region

Import pyarrow.dataset as ds import time

s3Dataset = ds.dataset(‘location’) # s3 or local file system scanner = s3Dataset.scanner() batches = scanner.to_batches() st = time.time()

for batch in batches: batch.num_rows

et = time.time()

elapsed = et - st

code ends here

total files 50 and total size of 500 mb

Time taken to read from s3 is 970 sec Time taken to read from local file system 5 sec

Any insight why S3 is taking so much time?

Is there any settings we are missing when we are reading files from S3 which would give decent performance?

Any help appreciated.

Component(s)

FlightRPC, Parquet, Python, Other

Jul 10 '23 09:07 shaktishp

Generally, reading from S3 would usally much slower than read from local file. However, I guess it downgrade too much. Would you mind do a profile and find which part is spending more time when blocking?

And to_batches can specify filter and some read-ahead arguments, configure read ctx and threads, I guess you can try to enlarge them

Jul 10 '23 13:07 mapleFU

Will do the profiling and share the details with you

Jul 10 '23 14:07 shaktishp

S3 performance will depend of your connection to the S3 servers. Are you running this test on a local device? Or on an EC2 server?

If it's a local device then what kind of connection do you have to the internet? 970 seconds for 500MB of data is very slow. This is less than 0.1Mbps so any decent connection should be faster than this. If you set the environment variable AWS_EC2_METADATA_DISABLED (e.g. export AWS_EC2_METADATA_DISABLED=true) does it have any affect on performance?

Jul 10 '23 14:07 westonpace

@westonpace The code is running on a kubernetes pod.

Jul 11 '23 06:07 shaktishp

Are you able to try setting the environment variable I suggested?

Jul 11 '23 23:07 westonpace

I tried setting the env variable AWS_EC2_METADATA_DISABLED=true but i was not able to connect to S3.

Jul 12 '23 15:07 shaktishp

I am a little confused then.

When any operation is run by the S3 filesystem then the AWS SDK will attempt to determine credentials for that action. Typically this is done by looking in the user's config file (e.g. for ~/.aws/config). If this configuration file is not found then it will attempt to contact a special IP address that EC2 machines have configured which tells the EC2 machine what its configuration is.

This attempt to contact that special IP address can be very slow, depending on the network configuration of the machine (sometimes it will spend minutes waiting for a timeout). Setting variable AWS_EC2_METADATA_DISABLED will disable the check but that should only affect your connection if you are in an EC2 machine to begin with. So I do not understand how setting that variable to true can cause connection issues to S3.

Can you add these lines to the top of your script (these lines must come before you import any other pyarrow module)? This will add additional debugging information that might help us understand what is happening:

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

Jul 12 '23 15:07 westonpace

Thanks for that info. Let me try that.

Jul 13 '23 07:07 shaktishp

I am trying on my jupyter notebook. I can see the logs now printed but nothing i can highlight. Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

Jul 13 '23 10:07 shaktishp

Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.

I'm interested in the timing. I was looking to see if the 970s was spent waiting for requests to be responded to from S3 or spent trying to resolve configuration or spent in some other way. It should be possible to determine these things by correlating the timestamps in the trace logging.

You could redirect the log output to a file using an approach like this.

Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.

You mentioned that your code is running in a kubernetes pod. Is this pod on an EC2 instance? Is this perhaps EKS?

Jul 13 '23 13:07 westonpace

So the time is taken when looping through the record batches. I will try to redirect the logs,

The code is running on EKS.

Jul 13 '23 14:07 shaktishp

https://github.com/apache/arrow/issues/36765 Would this be the same issue? @westonpace

Jul 19 '23 06:07 mapleFU

https://github.com/apache/arrow/issues/36765 Would this be the same issue? @westonpace

I would not expect #36765 to cause this large of a delay.

Jul 19 '23 16:07 westonpace

@shaktishp could you try adding the running this ?


ds.dataset(
    'location', 
    format=ds.ParquetFileFormat(
        default_fragment_scan_options=ds.ParquetFragmentScanOptions(
            pre_buffer=True
        )
    )
)

This may speed up your s3 read time

Jul 19 '23 16:07 Akshay-A-Kulkarni

Hi @Akshay-A-Kulkarni I stumble upon this issue since I had major time reading performance for files in s3 when using dataset.to_batches. your suggestion in the previous comment helped A LOT. do you care to explain what it does exactly and how does it affect the time of reading so much? Thanks!

Aug 13 '24 14:08 xshirax

This issue has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this issue will be closed in 14 days. If this usage question has evolved into a feature request or docs update, please remove the 'Type: usage' label and add the 'Type: enhancement' label instead.

Nov 18 '25 12:11 thisisnic

Performance issues when loading data from S3

Describe the usage question you have. Please include as many useful details as possible.

jupyter notebook code snippet

jupyter notebook and s3 is in same region

code ends here

Component(s)