emr-serverless-samples icon indicating copy to clipboard operation
emr-serverless-samples copied to clipboard

Example code for running Spark and Hive jobs on EMR Serverless.

EMR Serverless Samples

This repository contains example code for getting started with EMR Serverless and using it with Apache Spark and Apache Hive.

In addition, it provides Container Images for both the Spark History Server and Tez UI in order to debug your jobs.

For full details about using EMR Serverless, please see the EMR Serverless documentation.

Pre-Requisites

These demos assume you are using an Administrator-level role in your AWS account

  1. Amazon EMR Serverless is now Generally Available! Check out the console to Get Started with EMR Serverless.

  2. Create an Amazon S3 bucket in region where you want to use EMR Serverless (we'll assume us-east-1).

aws s3 mb s3://BUCKET-NAME --region us-east-1
  1. Create an EMR Serverless execution role (replacing BUCKET-NAME with the one you created above)

This role provides both S3 access for specific buckets as well as read and write access to the Glue Data Catalog.

aws iam create-role --role-name emr-serverless-job-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "emr-serverless.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }
    ]
  }'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name S3Access --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadFromOutputAndInputBuckets",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::noaa-gsod-pds/*",
                "arn:aws:s3:::BUCKET-NAME",
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        },
        {
            "Sid": "WriteToOutputDataBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        }
    ]
}'

aws iam put-role-policy --role-name emr-serverless-job-role --policy-name GlueAccess --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "GlueCreateAndReadDataCatalog",
        "Effect": "Allow",
        "Action": [
            "glue:GetDatabase",
            "glue:GetDataBases",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:GetPartition",
            "glue:GetPartitions",
            "glue:CreatePartition",
            "glue:BatchCreatePartition",
            "glue:GetUserDefinedFunctions"
        ],
        "Resource": ["*"]
      }
    ]
  }'

Now you're ready to go! Check out the examples below.

Examples

  • CloudFormation Templates

    Sample templates for creating an EMR Serverless application as well as various dependencies.

  • CloudWatch Dashboard Template

    Template for creating a CloudWatch Dashboard for monitoring your EMR Serverless application.

  • CDK Examples

    Examples of building EMR Serverless environments with Amazon CDK.

  • Airflow Operator

    Sample DAGs and preview version of the Airflow Operator. Check the releases page for updates.

  • EMR Serverless PySpark job

    This sample script shows how to use EMR Serverless to run a PySpark job that analyzes data from the open NOAA Global Surface Summary of Day dataset.

  • Python Dependencies

    Shows how to package Python dependencies (Great Expectations) using a Virtualenv and venv-pack.

  • Custom Python version

    Shows how to use a different Python version than the default (3.7.10) provided by EMR Serverless.

  • Genomics analysis using Glow

    This sample shows how to use EMR Serverless to combine both Python and Java dependencies in order to run genomic analysis using Glow and 1000 Genomes.

  • EMR Serverless Hive query

    This sample script shows how to use Hive in EMR Serverless to query the same NOAA data.

SDK Usage

You can call EMR Serverless APIs using standard AWS SDKs. The examples below show how to do this.

  • EMR Serverless boto3 example
  • EMR Serverless Java SDK example

Utilities

The following UIs are available in the EMR Serverless console, but you can still use them locally if you wish.

  • Spark UI- Use this Dockerfile to run Spark history server in a container.

  • Tez UI- Use this Dockerfile to run Tez UI and Application Timeline Server in a container.

Other Resources

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.