trino icon indicating copy to clipboard operation
trino copied to clipboard

Native Grok Reader Implementation

Open bangtim opened this issue 8 months ago • 3 comments

Description

Native reader implementation for Grok format.

This PR is implementing a GrokDeserializer as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).

The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4

  • The library includes an api that allows us to parse logs as well as some basic unit tests

Questions/concerns:

  • ~~One thing to pay attention to is the LICENSE~~
  • ~~The header is different, thus the build fails (with the same header as other files, the build succeeds locally) - How~~ ~~should we make sure the header is properly citing the authors/contributors of the open source grok library? cc:~~ ~~@martint~~
  • ~~What should the getHiveSerDeClassNames value be?~~

The implementation(everything aside from java grok library) for the reader was done in the following files:

  • trino-hive-formats module:
    • GrokDeserializer + GrokDeserializerFactory --> our implementation of the Deserializer
      • Very similar to regex
    • TestGrokFormat --> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)
    • pom.xml
  • trino-hive module:
    • HiveModule
    • HiveClassNames
    • HiveMetadata
    • HiveStorageFormat
    • HiveTableProperties
    • GrokFileWriterFactory
    • GrokPageSourceFactory
    • BaseHiveConnectorTest
    • HiveTestUtils
    • TestGrokTable
    • TestHivePageSink
    • pom.xml

Example of how it's used

Say we have a log file that looks like so:

55.3.244.1 GET /index.html 15824 10
10.0.0.15 POST /login.php 2341 15
144.76.92.155 GET /downloads/file.zip 234567 45
55.3.244.1 GET /index.html  10
209.85.231.104 POST /checkout 12345 22

And we create a table:

CREATE TABLE test_grok_table (
    client VARCHAR,
    method VARCHAR,
    request VARCHAR,
    bytes BIGINT,
    duration BIGINT)
WITH (
    format = 'grok',
    grok_input_format = '%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}',
    external_location = '<enter location of log file>')

If we run a SELECT * on this table we should expect the results:

| client         | method | request              | bytes  | duration |
|----------------|--------|----------------------|--------|----------|
| 55.3.244.1     | GET    | /index.html          | 15824  | 10       |
| 10.0.0.15      | POST   | /login.php           | 2341   | 15       |
| 144.76.92.155  | GET    | /downloads/file.zip  | 234567 | 45       |
| null           | null   | null                 | null   | null     | **
| 209.85.231.104 | POST   | /checkout            | 12345  | 22       |

** Notice row 4 - the log line doesn't match the grok_input_format therefore it will return null for each column

Additional context and related issues

Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. (x) Release notes are required, with the following suggested text:

## Section
* Add native Grok file format reader. ({issue}`25205 `)

bangtim avatar Mar 03 '25 19:03 bangtim