aws-serverless-data-lake-framework icon indicating copy to clipboard operation
aws-serverless-data-lake-framework copied to clipboard

small change to legislators to support NDJSON

Open mariandumitrascu opened this issue 2 years ago • 0 comments

Dear SDLF Team,

I recently bumped in SDLF and I love the concept and the architecture. I have 3 items about the "Testing the Framework" section of the Serverless Data Lake Workshop.

  1. The provided code in the legislators pipeline, expects input files to be a fully valid JSON files, that is an array of dictionaries, while the original example from AWS Glue documentation (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html ) deals with NDJSON, that is json objects on each line (new-line being the separator, no comma separator. In real world NDJSON is the most common case. For example, AWS Connect log files are NDJSON.

  2. One of the input file in ./sdlf-utils/pipeline-examples/legislators/data more precisely regions.json is corrupted and I understand. this is by design. This is explained in closed issue #28. However I think this should be more explicitly explained in a README file associated with the example.

  3. Can I do a pull request to modify the legislators example to handle NDJSON files and figure out what type of JSON is? It looks like this project is somehow inactive, I would like to be active in it and also make some additions such as more examples and a UI in the near future.

Thank you, -Marian

mariandumitrascu avatar Jun 08 '22 21:06 mariandumitrascu