Performance issues when loading data from S3
Describe the usage question you have. Please include as many useful details as possible.
We are doing one small poc wherein we are comparing the performance when we load parquet files directly from s3 or from local file system
jupyter notebook code snippet
jupyter notebook and s3 is in same region
Import pyarrow.dataset as ds import time
s3Dataset = ds.dataset(‘location’) # s3 or local file system scanner = s3Dataset.scanner() batches = scanner.to_batches() st = time.time()
for batch in batches: batch.num_rows
et = time.time()
elapsed = et - st
code ends here
total files 50 and total size of 500 mb
Time taken to read from s3 is 970 sec Time taken to read from local file system 5 sec
Any insight why S3 is taking so much time?
Is there any settings we are missing when we are reading files from S3 which would give decent performance?
Any help appreciated.
Component(s)
FlightRPC, Parquet, Python, Other
Generally, reading from S3 would usally much slower than read from local file. However, I guess it downgrade too much. Would you mind do a profile and find which part is spending more time when blocking?
And to_batches can specify filter and some read-ahead arguments, configure read ctx and threads, I guess you can try to enlarge them
Will do the profiling and share the details with you
S3 performance will depend of your connection to the S3 servers. Are you running this test on a local device? Or on an EC2 server?
If it's a local device then what kind of connection do you have to the internet? 970 seconds for 500MB of data is very slow. This is less than 0.1Mbps so any decent connection should be faster than this. If you set the environment variable AWS_EC2_METADATA_DISABLED (e.g. export AWS_EC2_METADATA_DISABLED=true) does it have any affect on performance?
@westonpace The code is running on a kubernetes pod.
Are you able to try setting the environment variable I suggested?
I tried setting the env variable AWS_EC2_METADATA_DISABLED=true but i was not able to connect to S3.
I am a little confused then.
When any operation is run by the S3 filesystem then the AWS SDK will attempt to determine credentials for that action. Typically this is done by looking in the user's config file (e.g. for ~/.aws/config). If this configuration file is not found then it will attempt to contact a special IP address that EC2 machines have configured which tells the EC2 machine what its configuration is.
This attempt to contact that special IP address can be very slow, depending on the network configuration of the machine (sometimes it will spend minutes waiting for a timeout). Setting variable AWS_EC2_METADATA_DISABLED will disable the check but that should only affect your connection if you are in an EC2 machine to begin with. So I do not understand how setting that variable to true can cause connection issues to S3.
Can you add these lines to the top of your script (these lines must come before you import any other pyarrow module)? This will add additional debugging information that might help us understand what is happening:
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)
Thanks for that info. Let me try that.
I am trying on my jupyter notebook. I can see the logs now printed but nothing i can highlight. Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.
Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.
Is there anything i have to look out for? Also one thing i noticed the that my code snippet never finishes.
I'm interested in the timing. I was looking to see if the 970s was spent waiting for requests to be responded to from S3 or spent trying to resolve configuration or spent in some other way. It should be possible to determine these things by correlating the timestamps in the trace logging.
You could redirect the log output to a file using an approach like this.
Also i tried using AWS_EC2_METADATA_DISABLED=true and i got access denied.
You mentioned that your code is running in a kubernetes pod. Is this pod on an EC2 instance? Is this perhaps EKS?
So the time is taken when looping through the record batches. I will try to redirect the logs,
The code is running on EKS.
https://github.com/apache/arrow/issues/36765 Would this be the same issue? @westonpace
https://github.com/apache/arrow/issues/36765 Would this be the same issue? @westonpace
I would not expect #36765 to cause this large of a delay.
@shaktishp could you try adding the running this ?
ds.dataset(
'location',
format=ds.ParquetFileFormat(
default_fragment_scan_options=ds.ParquetFragmentScanOptions(
pre_buffer=True
)
)
)
This may speed up your s3 read time
Hi @Akshay-A-Kulkarni I stumble upon this issue since I had major time reading performance for files in s3 when using dataset.to_batches. your suggestion in the previous comment helped A LOT. do you care to explain what it does exactly and how does it affect the time of reading so much? Thanks!
This issue has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this issue will be closed in 14 days. If this usage question has evolved into a feature request or docs update, please remove the 'Type: usage' label and add the 'Type: enhancement' label instead.