trino
trino copied to clipboard
Native Grok Reader Implementation
Description
Native reader implementation for Grok format.
This PR is implementing a GrokDeserializer as well as porting over the entire Grok library (Athena depends on release 0.1.4 with some minor bug fixes and changes to support date data type).
The Java Grok library can be found here: https://github.com/thekrakken/java-grok/tree/grok-0.1.4
- The library includes an api that allows us to parse logs as well as some basic unit tests
Questions/concerns:
- ~~One thing to pay attention to is the LICENSE~~
- ~~The header is different, thus the build fails (with the same header as other files, the build succeeds locally) - How~~ ~~should we make sure the header is properly citing the authors/contributors of the open source grok library? cc:~~ ~~@martint~~
- ~~What should the
getHiveSerDeClassNamesvalue be?~~
The implementation(everything aside from java grok library) for the reader was done in the following files:
trino-hive-formatsmodule:GrokDeserializer+GrokDeserializerFactory--> our implementation of the Deserializer- Very similar to regex
TestGrokFormat--> some additional unit tests + tests against examples found in athena docs (reading line, following format of other native reader tests)pom.xml
trino-hivemodule:HiveModuleHiveClassNamesHiveMetadataHiveStorageFormatHiveTablePropertiesGrokFileWriterFactoryGrokPageSourceFactoryBaseHiveConnectorTestHiveTestUtilsTestGrokTableTestHivePageSinkpom.xml
Example of how it's used
Say we have a log file that looks like so:
55.3.244.1 GET /index.html 15824 10
10.0.0.15 POST /login.php 2341 15
144.76.92.155 GET /downloads/file.zip 234567 45
55.3.244.1 GET /index.html 10
209.85.231.104 POST /checkout 12345 22
And we create a table:
CREATE TABLE test_grok_table (
client VARCHAR,
method VARCHAR,
request VARCHAR,
bytes BIGINT,
duration BIGINT)
WITH (
format = 'grok',
grok_input_format = '%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}',
external_location = '<enter location of log file>')
If we run a SELECT * on this table we should expect the results:
| client | method | request | bytes | duration |
|----------------|--------|----------------------|--------|----------|
| 55.3.244.1 | GET | /index.html | 15824 | 10 |
| 10.0.0.15 | POST | /login.php | 2341 | 15 |
| 144.76.92.155 | GET | /downloads/file.zip | 234567 | 45 |
| null | null | null | null | null | **
| 209.85.231.104 | POST | /checkout | 12345 | 22 |
** Notice row 4 - the log line doesn't match the grok_input_format therefore it will return null for each column
Additional context and related issues
Athena supports the GrokSerde and this is a bug-for-bug implementation for what Athena currently has.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required. ( ) Release notes are required. Please propose a release note for me. (x) Release notes are required, with the following suggested text:
## Section
* Add native Grok file format reader. ({issue}`25205 `)