PyAthena
PyAthena copied to clipboard
Impl s3fs cursor
A cursor implementation to read CSV files in S3 without using Pandas.
It would be good to be able to use awsathena+s3fs
in SQLAlchemy.
https://github.com/fsspec/s3fs https://docs.python.org/3/library/csv.html
AbstractFileSystem https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L92 AbstractBufferedFile https://github.com/fsspec/filesystem_spec/blob/2022.7.1/fsspec/spec.py#L1299
S3FileSystem https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L168 S3File https://github.com/fsspec/s3fs/blob/2022.7.1/s3fs/core.py#L1822
It appears that awswrangler takes the approach of splitting the files into smaller chunk sizes and using ThreadPoolExecutor to retrieve them in parallel. https://github.com/awslabs/aws-data-wrangler/blob/2.16.1/awswrangler/s3/_fs.py#L262-L300
Since s3fs depends on aiobotocore, and aiobotocore's botocore library has strict version dependencies, it seems like a good idea to create my own S3 file system using ThreadPoolExecutor, a similar approach to awswrangler, instead of asyncio.