aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

Feature Request: Validate DagNode Type enum members in Glue CreateScript

Open MrGossett opened this issue 5 years ago • 6 comments

aws glue create-script will generate a script to use for a Glue Job given a description of DAG nodes and the edges between them. Nodes can be sources, sinks, or transforms.

API Docs for the CodeGenNode structure show that its NodeType attribute is required, and that it's a UTF-8 string. The description says "The type of node that this is."

aws glue create-script help reinforces this:

       --dag-nodes (list)
          A list of the nodes in the DAG.

       Shorthand Syntax:

          Id=string,NodeType=string,Args=[{Name=string,Value=string,Param=boolean},{Name=string,Value=string,Param=boolean}],LineNumber=integer ...

       JSON Syntax:

          [
            {
              "Id": "string",
              "NodeType": "string",
              "Args": [
                {
                  "Name": "string",
                  "Value": "string",
                  "Param": true|false
                }
                ...
              ],
              "LineNumber": integer
            }
            ...
          ]

However, I can't find anywhere in the docs or in CLI help the list of supported values for NodeType.

Here is a JSON file describing the input to aws glue create-script:

{
  "DagNodes": [
    {
      "Id": "source",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSource\"" }
      ]
    },
    {
      "Id": "transform",
      "NodeType": "ResolveChoice",
      "Args": [{ "Name": "specs", "Value": "[('amount_due', 'cast:double')]" }]
    },
    {
      "Id": "sink",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSink\"" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "source", "Target": "transform" },
    { "Source": "transform", "Target": "sink" }
  ],
  "Language": "PYTHON"
}

Generating a script using that JSON input is successful:

$ aws glue create-script --cli-input-json file://input.json
{
    "PythonScript": "import sys\nfrom awsglue.transforms import *\nfrom awsglue.utils import getResolvedOptions\nfrom pyspark.context import SparkContext\nfrom awsglue.context import GlueContext\nfrom awsglue.job import Job\n\n## @params: [JOB_NAME]\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglueContext = GlueContext(sc)\nspark = glueContext.spark_session\njob = Job(glueContext)\njob.init(args['JOB_NAME'], args)\n## @type: DataSource\n## @args: [database = \"MyTestDatabase\", table_name = \"MyTestTableSource\", transformation_ctx = \"source\"]\n## @return: source\n## @inputs: []\nsource = glueContext.create_dynamic_frame.from_catalog(database = \"MyTestDatabase\", table_name = \"MyTestTableSource\", transformation_ctx = \"source\")\n## @type: ResolveChoice\n## @args: [specs = [('amount_due', 'cast:double')], transformation_ctx = \"transform\"]\n## @return: transform\n## @inputs: [frame = source]\ntransform = ResolveChoice.apply(frame = source, specs = [('amount_due', 'cast:double')], transformation_ctx = \"transform\")\n## @type: DataSink\n## @args: [database = \"MyTestDatabase\", table_name = \"MyTestTableSink\", transformation_ctx = \"sink\"]\n## @return: sink\n## @inputs: [frame = transform]\nsink = glueContext.write_dynamic_frame.from_catalog(frame = transform, database = \"MyTestDatabase\", table_name = \"MyTestTableSink\", transformation_ctx = \"sink\")\njob.commit()"
}

However, if I change the transformation from ResolveChoice to Map, I get an error.

Here is the updated input.json:

{
  "DagNodes": [
    {
      "Id": "source",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSource\"" }
      ]
    },
    {
      "Id": "transform",
      "NodeType": "Map",
      "Args": [{ "Name": "f", "Value": "my_custom_function" }]
    },
    {
      "Id": "sink",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSink\"" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "source", "Target": "transform" },
    { "Source": "transform", "Target": "sink" }
  ],
  "Language": "PYTHON"
}

Notice the only thing that has changed is the definition of the transform node.

The create-script action now returns an error:

$ aws glue create-script --cli-input-json file://input.json

An error occurred (InvalidInputException) when calling the CreateScript operation: Unknown NodeType Map in GenerateCode

Apparently Map is not supported, but ResolveChoice is supported.

It would be very helpful if there was documentation somewhere listing which transforms are supported by the aws glue create-script action.

MrGossett avatar Nov 20 '19 21:11 MrGossett

I ran through a brute force search, updating my example script above with each of the transforms listed in the PySpark Transforms section of the Glue docs. Here are my results:

Transform Result
ApplyMapping supported ✅
DropFields supported ✅
DropNullFields supported ✅
ErrorsAsDynamicFrame unsupported ❌
Filter unsupported ❌
FlatMap unsupported ❌
Join supported ✅
Map unsupported ❌
MapToCollection unsupported ❌
Relationalize supported ✅
RenameField supported ✅
ResolveChoice supported ✅
SelectFields supported ✅
SelectFromCollection unsupported ❌
Spigot supported ✅
SplitFields supported ✅
SplitRows supported ✅
Unbox supported ✅
UnnestFrame unsupported ❌

MrGossett avatar Nov 20 '19 22:11 MrGossett

@MrGossett, I've filed an internal ticket to pass this request on to the Glue doc writing team. They own the content that ends up in this particular CLI description. I'll ask them to flesh out the meaning of the NodeType element. Thanks for the feedback!

(V156194273)

bisdavid avatar Nov 20 '19 22:11 bisdavid

@bisdavid any idea if an update to the Glue docs is planned?

MrGossett avatar Mar 17 '20 21:03 MrGossett

Hi @MrGossett, I confirmed that the Glue team is aware of the issue, but no ETA as to when it will be changed.

kdaily avatar Sep 23 '20 21:09 kdaily

I ran into the same.. this works for me

{
  "DagNodes": [
    {
      "Id": "DataSource0",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "mydatabase_source" },
        { "Name": "table_name", "Value": "mytable_source" },
        { "Name": "transformation_ctx", "Value": "DataSource0" }
      ]
    },
    {
      "Id": "Transform1",
      "NodeType": "CustomCode",
      "Args": [
               { "Name": "code", "Value":"pass" },
                {"Name": "className", "Value":"MyTransform"},
                {"Name": "dynamicFrameConstruction", "Value": "DynamicFrameCollection{\"DataSource0\":DataSource0}" },
                {"Name": "classification", "Value":"Transform"},
                {"Name": "dfc", "Value":"Transform1"},
                {"Name": "transformation_ctx", "Value":"Transform1"}
        ]
    },
    {
      "Id": "Transform0",
      "NodeType": "SelectFromCollection",
      "Args": [
       { "Name": "key", "Value": "list(Transform1.keys())[0]" },
       { "Name": "transformation_ctx", "Value": "Transform0" }
      ]
    },
    {
      "Id": "DataSink0",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "mydatabase_sink" },
        { "Name": "table_name", "Value": "mytable_sink" },
        { "Name": "transformation_ctx", "Value": "DataSink0" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "DataSource0", "Target": "Transform1" },
    { "Source": "Transform1", "Target": "Transform0" },
    { "Source": "Transform0", "Target": "DataSink0" }
  ],
  "Language": "PYTHON"
}

vivshri avatar Aug 08 '21 18:08 vivshri

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

github-actions[bot] avatar Aug 09 '22 20:08 github-actions[bot]