flow icon indicating copy to clipboard operation
flow copied to clipboard

Filesystem - AWS Adapter

Open norberttech opened this issue 1 year ago • 8 comments

To support AWS S3 remote storage we should implement the following components first:

  • AWS SDK - (covering only authentication and S3 storage operations)
  • Filesystem AWS Bridge - providing implementation of a Filesystem interface

SDK must be designed to work with PSR contracts instead of any specific implementation of HTTP client. It should also not bring any external dependencies unless it's absolutely critical.

norberttech avatar Jul 05 '24 11:07 norberttech

Good Morning @norberttech :)

I will try to move this task forward, and I would appreciate any help from your side (I believe it won't be a problem since you already support me a lot 😄 ).

as I can see on your description, that we MUST implement an aws sdk, right? or do you think, we can reuse anything from aws-sdk, or should we implement an aws sdk for flow organisation, like azure-sdk? if yes maybe you could create a aws-sdk repository on flow organisation :).

I had take a look on azure-sdk and it doesn't look a small one hehehehe, But could you give me an idea where I should start? what is the minimum implementation that I should focus to have this working for write and read parquet files?

maybe you could list what you have in mind for this ticket?

Note: I am using this task instead of discord, because IMO it gonna be "documented" what we planed :)

Thank you in advance.

eerison avatar Aug 27 '24 07:08 eerison

You had also mentioned on discord that parquet files won't work well with others filesystem like: https://github.com/thephpleague/flysystem?tab=readme-ov-file

Could you explain better what kind of issues did you face?

eerison avatar Aug 27 '24 07:08 eerison

hey @eerison!

I will try to move this task forward, and I would appreciate any help from your side (I believe it won't be a problem since you already support me a lot 😄 ).

Of course, I'm going to do my best to assist you in this task :)

Note: I am using this task instead of discord, because IMO it gonna be "documented" what we planed :)

Perfect! For quick questions discord is better/faster but you are absolutely right about keeping it documented somewhere and GitHub is the best place for it.

AWS SDK

as I can see on your description, that we MUST implement an aws sdk, right? or do you think, we can reuse anything from aws-sdk, or should we implement an aws sdk for flow organisation, like azure-sdk? if yes maybe you could create a aws-sdk repository on flow organisation :).

So here is the thing, it's not a MUST, but more a VERY NICE TO HAVE. Why? Because if you take a look at existing aws-sdk you will notice that it comes with a insane amount of dependencies as for a SDK.

php: >=7.2.5
ext-json: *
ext-pcre: *
ext-simplexml: *
aws/aws-crt-php: ^1.2.3
guzzlehttp/guzzle: ^6.5.8 || ^7.4.5
guzzlehttp/promises: ^1.4.0 || ^2.0
guzzlehttp/psr7: ^1.9.1 || ^2.4.5
mtdowling/jmespath.php: ^2.6
psr/http-message: ^1.0 || ^2.0

And I'm more into letting users choose their HTTP client implementations so I prefer to relay on PSR contracts rather than injecting guzzle into all projects that would like to use ETL.

However SDK can be replaced, as it is only a technical detail, so if it would make it easier for you, you can start from using an existing AWS SDK, that over time we can replace with our own custom implementation that would cover only S3 API.

You had also mentioned on discord that parquet files won't work well with others filesystem like: https://github.com/thephpleague/flysystem?tab=readme-ov-file Could you explain better what kind of issues did you face?

Parquet pretty much requires seeking, since we need to go to the bottom of the file first, read it's metadata and then seek back to the beginning to read row groups. Flysystem abstraction opens a regular HTTP stream which does not allow seeking, that's why I had to implement our own Filesystem Abstraction.

How to implement AWS Filesystem for Flow

Flow filesystem is pretty much split into two parts:

1) Source Stream

In order to implement a source stream, you need to rely on Byte Range fetching, which pretty much does what seeking is doing, reading specific number of bytes from a given offset.

As an inspiration, you can take a look at AzureSourceStream.

Once you get AmazonS3SourceStream working you can then create AmazonS3Filesystem that implements Filesystem interface.

2) Filesystem

SourceStream allows us to read from a file but Filesytem provides a few more things, one of the most important ones is list.

This method allows us to iterate through all files in the folder using the glob pattern.

3) Destination Stream

This one is a bit more tricky, for example on what Azure Destination Stream is doing, is that it writes to Blocks, and whenever a block reaches the block size limit, it uploads the block to the cloud while it keeps writing to another block in memory. Once the whole stream is uploaded (in one or multiple blocks) it closes the connection by submitting a list of all blocks to Azure. From what I remember very similar mechanism exists on S3 storage.

[!TIP] I would first start from SourceStream and FIlesystem and only then look at Destination Stream. We can even merge SourceStream and Filesystm without Destination Stream initially.


I would put this under the following path in monorepo:

- src
  - bridge
    - filesystem
      - aws 

and name it filesystem-aws-bridge.

Summary

So like I mentioned at the beginning it's fine to start with aws/aws-sdk-php however I would later anyway implement our own, lighter version of this SDK that is based on PSR contracts not on specific implementations.

It's also perfectly fine (maybe even recommended) to work on smaller chunks at time and my suggestion would be to:

  1. implement SourceStream
  2. implement all methods on Filesystem except those that are about writing to destination
  3. implement DestinationStream

All 3 steps can even be separated pull requests (but smaller chunks are also fine).

I'm currently on vacation so my availability is a bit limited, but feel free to reach out on Discord in case of quick questions, I will try to answer them as soon as possible.

It might seem like a lot, but once you split it into smaller steps and focus on one step at time, it should turn out doable and not that scary anymore.

Also, I want to say, that this is a HUGE contribution, for which me and everyone should be very grateful, so if you manage to implement it, I will make sure it will be remembered 💪

norberttech avatar Aug 27 '24 10:08 norberttech

Hey @norberttech

Thank you for the really well detailed infos.

do you think is it possible I start by DestinationStream, and already use it? or I can't make this work just with Destination?

eerison avatar Aug 28 '24 09:08 eerison

Hey @norberttech

Thank you for the really well detailed infos.

do you think is it possible I start by DestinationStream, and already use it? or I can't make this work just with Destination?

yeah, you can start with DestinationStream, but in to use it, you will at least need to cover the writeTo/mv/rm since writers are using those methods with specific saveModes. I recommended starting from reading because its just easier :)

norberttech avatar Aug 28 '24 21:08 norberttech

hey @eerison just wanted to check if you need any assist with this one? Of course no pressure 😊

norberttech avatar Sep 04 '24 15:09 norberttech

Hey @norberttech

Well I was requested just to upload the file and read it via athena(aws service), then I needed to stop this implementation for now 😞.

After I finish my current task , I will try to back for it.

eerison avatar Sep 04 '24 16:09 eerison

no worries, in case you would need any help, feel free to reach out 🙂

norberttech avatar Sep 04 '24 16:09 norberttech

Resolved by #1281

norberttech avatar Dec 28 '24 14:12 norberttech