divolte-collector icon indicating copy to clipboard operation
divolte-collector copied to clipboard

No documentation on S3 sink setup

Open tatianafrank opened this issue 5 years ago • 10 comments

The documentation says you can use S3 as a file sink but gives zero details on how to do so. There is one line linking somewhere else but the link is broken. These are the docs: http://divolte-releases.s3-website-eu-west-1.amazonaws.com/divolte-collector/0.9.0/userdoc/html/configuration.html and this is the broken link: https://wiki.apache.org/hadoop/AmazonS3

tatianafrank avatar Jul 17 '19 22:07 tatianafrank

Divolte doesn't treat S3 any differently than HDFS. This means you can use the built in support of the HDFS client to access S3 buckets of a particular layout.

Divolte currently ships with hadoop 3.2.0, so the relevant updated link on AWS integration (including using S3 filesystems) is here: https://hadoop.apache.org/docs/r2.9.2/hadoop-aws/tools/hadoop-aws/index.html

Note that there are now three different S3 client implementations in Hadoop, which all use different layouts on S3. If your aim is to use Divolte just for collection and subsequently use the Avro files on S3 using tools other than Hadoop, the s3n or s3a is probably what you want. s3n has been available for a while, whereas s3a is still under development, but is aimed to be a drop in replacement for s3n down the line. s3a is mostly aimed at use cases at substantial scale, involving large files that can become a performance issue for s3n.

friso avatar Jul 18 '19 07:07 friso

ok im using s3a with the following config: client { fs.DEFAULT_FS = "https://s3.us.cloud-object-storage.appdomain.cloud" fs.defaultFS = "https://s3.us.cloud-object-storage.appdomain.cloud" fs.s3a.bucket.BUCKET_NAME.access.key = "" fs.s3a.bucket.BUCKET_NAME.secret.key = "" fs.s3a.bucket.BUCKET_NAME.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud" }

But im getting the following error even though I do have a tmp/working directory: Path for in-flight AVRO records is not a directory: /tmp/working So Im guessing its not properly connecting to s3 since the directory DOES exist. Is something wrong about my config? My s3 provider is not AWS but another cloud provider so I used the URL structure is a little different. Am I supposed to set the fs.defaultFS to the s3 URL? Where do I set the bucket?

tatianafrank avatar Jul 18 '19 21:07 tatianafrank

I changed my settings to the below and tried both s3a, s3n, and s3 and got the same error: "org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3a" (or "s3n" or "s3")

client { fs.DEFAULT_FS = "s3a://BUCKET-NAME" fs.defaultFS = "s3a://BUCKET-NAME" fs.s3a.access.key = "" fs.s3a.secret.key = "" fs.s3a.endpoint = "https://s3.us.cloud-object-storage.appdomain.cloud" }

tatianafrank avatar Jul 18 '19 21:07 tatianafrank

The libraries might not be shipped with divolte and you need some additional settings

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

depending on your version of hadoop: http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar

krisgeus avatar Jul 19 '19 08:07 krisgeus

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

krisgeus avatar Jul 19 '19 13:07 krisgeus

Oh and make sure the bucket is availabe and the tmp/s3working and tmp/s3publish keys are present. (A directory exists check is done so adding a file to the bucket with a correct key prefix fools the hdfs client)

krisgeus avatar Jul 19 '19 13:07 krisgeus

thanks for looking into this @krisgeus im just a little confused about something. Im trying to use s3 instead of hdfs so why do I need hdfs to be running for this to work?

tatianafrank avatar Jul 26 '19 19:07 tatianafrank

I did everything you listed above and its not working. I got an error about a missing hadoop.tmp.dir var so I added that and now theres no error but there's no files being added to s3 either. Since theres no error im not sure what the issue is.

tatianafrank avatar Jul 26 '19 21:07 tatianafrank

Sorry for the late response (Holiday season). Without an error I cannot help you out either. With the steps provided above I managed to create a working example based on the divolte docker image.

krisgeus avatar Aug 13 '19 12:08 krisgeus

I did a quick check with the docker divolte image. This is what was needed:

libraries downloaded and put into /opt/divolte/divolte-collector/lib http://central.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.271/aws-java-sdk-bundle-1.11.271.jar http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.1.1/hadoop-aws-3.1.1.jar

config: client { fs.DEFAULT_FS = "s3a://avro-bucket" fs.defaultFS = "s3a://avro-bucket" fs.s3a.access.key = foo fs.s3a.secret.key = bar fs.s3a.endpoint = "s3-server:4563" fs.s3a.path.style.access = true fs.s3a.connection.ssl.enabled = false fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3.impl = org.apache.hadoop.fs.s3a.S3AFileSystem }

enable hdfs through env vars in docker-compose: DIVOLTE_HDFS_ENABLED: "true" DIVOLTE_HDFS_SINK_WORKING_DIR: "s3a://avro-bucket/tmp/s3working" DIVOLTE_HDFS_SINK_PUBLISH_DIR: "s3a://avro-bucket/tmp/s3publish"

s3-server is a localstack docker container which mimics s3.

Been trying this but keep getting Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found error. Do i need to install the complete hadoop application as well or am i missing something else?

edit: it seems the libraries are very particular on the version you use solution: https://hadoop.apache.org/docs/r3.3.1/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html

rakzcs avatar Jan 24 '22 09:01 rakzcs