snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Stores Snowplow enriched events in Redshift, Snowflake and Databricks
The json for Iglu resolver needs to be completely "rendered" when the loader starts, it can't contain any environment variable (e.g `API_KEY`), which is an issue when defining a task...
`ATOMIC.EVENTS` [is hardcoded](https://github.com/snowplow/snowplow-rdb-loader/blob/5.3.2/modules/loader/src/main/scala/com/snowplowanalytics/snowplow/rdbloader/db/Statement.scala#L99) in the logs. To know if configuration changes have been taken into account and for troubleshooting, it would be useful to get the values from the configuration...
Currently, a single batch of parquet files can contain parquet files with different columns. For example, if there are 1000 events per file, but `context_1` is not seen until the...
Currently, if the streaming transformer is configured with 5 minute windows, then it emits batches at exactly 12:00, 12:05, 12:10 etc. If there are, say, 50 instances of the streaming...
Schema name in all loaders and catalog name in Databricks Loader should only contain letters, underscores, digits. Other characters shouldn't be allowed. We should add some check if names in...
We got this exception from the RDB Loader in version 5.1.2: ``` software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: RBPK6A3W5NR56MXB, Extended Request ID: 1aCgHbQ0qrB/pGglT2OnZadetbhzZM3l8CViLztGRQvNgCsSQC+/EEgV32BObneHGzThng+/WoQ=)...
RDB loader is unusual among Snowplow apps as [we override the default simplelogger properties](https://github.com/snowplow/snowplow-rdb-loader/blob/5.2.0/modules/loader/src/main/resources/simplelogger.properties): ``` org.slf4j.simpleLogger.showLogName=false org.slf4j.simpleLogger.showThreadName=false ``` I think this was done because historically we were grepping the logs...
If schema 1-0-0 had `{ ..."fieldX": {"type" : [ "string"] ...}` and schema 1-0-1 had `{ ..."fieldX": "type" : [ "string", null]...}`, then loader would not generate the ALTER table...
These metrics can very quickly rack up a lot of cost in CloudWatch as they are very chatty. Would suggest that they are disabled as default: https://github.com/snowplow/snowplow-rdb-loader/blob/9c7eee1b541e42225c35fa354cd4c5659531d679/modules/common-transformer-stream/src/main/resources/application.conf#L29 https://github.com/snowplow/snowplow-rdb-loader/blob/master/config/transformer/aws/transformer.kinesis.config.reference.hocon#L149
### Problem Some of the users highlighted issues with parquet loading: - Format options in [PR #1083](https://github.com/snowplow/snowplow-rdb-loader/pull/1083) - Performance degradation during COPY TO reported via [discorse](https://discourse.snowplow.io/t/impossible-to-run-multiple-databricks-loaders-due-to-fifo-requirement/) The investigation concluded that...