PyAthena icon indicating copy to clipboard operation
PyAthena copied to clipboard

Impl s3fs cursor

Open laughingman7743 opened this issue 2 years ago • 1 comments

A cursor implementation to read CSV files in S3 without using Pandas. It would be good to be able to use awsathena+s3fs in SQLAlchemy.

https://github.com/fsspec/s3fs https://docs.python.org/3/library/csv.html

laughingman7743 avatar Jan 23 '22 07:01 laughingman7743

AbstractFileSystem https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L92 AbstractBufferedFile https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L1299

S3FileSystem https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L168 S3File https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L1822

It appears that awswrangler takes the approach of splitting the files into smaller chunk sizes and using ThreadPoolExecutor to retrieve them in parallel. https://github.com/awslabs/aws-data-wrangler/blob/2.16.1/awswrangler/s3/_fs.py#L262-L300

Since s3fs depends on aiobotocore, and aiobotocore's botocore library has strict version dependencies, it seems like a good idea to create my own S3 file system using ThreadPoolExecutor, a similar approach to awswrangler, instead of asyncio.

laughingman7743 avatar Jul 31 '22 17:07 laughingman7743