kamu-cli icon indicating copy to clipboard operation
kamu-cli copied to clipboard

Kamu EventTime column can not have nulls exception

Open suresh852456 opened this issue 3 years ago • 1 comments

I am facting the issue with EventTime column, even though I have followed the date format as mentioned in the yaml file. please help yaml file start kind: DatasetSnapshot version: 1 content: name: Hiding Name kind: root metadata: - kind: setPollingSource fetch: kind: url url: Hiding URL read: kind: csv separator: "," header: true nullValue: "" preprocess: kind: sql engine: spark query: > SELECT CAST(id as BIGINT) as id, CAST(UNIX_TIMESTAMP(date, "yyyy-MM-dd") as TIMESTAMP) as date, username as username,name as name, tweet as tweet, language as language, mentions as mentions, urls as urls, photos as photos, replies_count as replies_count, retweets_count as retweets_count, likes_count as likes,hashtags as hashtags,link as link, retweet as retweet, quote_url as quote_url, video as video, thumbnail as thumbnail, reply_to as reply_to FROM input merge: kind: ledger primaryKey: - id - kind: setVocab eventTimeColumn: date yaml file end

Kamu_Error Sample_data_to_upload.csv

suresh852456 avatar Jul 29 '22 16:07 suresh852456

Hi @suresh852456, thanks for providing all the information and a data sample.

I reformated the YAML as follows:

kind: DatasetSnapshot
version: 1
content:
  name: test
  kind: root
  metadata:
    - kind: setPollingSource
      fetch:
        kind: url
        url: file:///home/smikhtoniuk/Downloads/2/sample.csv
      read:
        kind: csv
        separator: ","
        header: true
        nullValue: ""
      preprocess:
        kind: sql
        engine: spark
        query: >
          SELECT
            CAST(id as BIGINT) as id,
            CAST(UNIX_TIMESTAMP(date, "yyyy-MM-dd") as TIMESTAMP) as date,
            username as username,
            name as name, 
            tweet as tweet, 
            language as language, 
            mentions as mentions, 
            urls as urls, 
            photos as photos, 
            replies_count as replies_count,
            retweets_count as retweets_count,
            likes_count as likes,
            hashtags as hashtags,
            link as link, 
            retweet as retweet,
            quote_url as quote_url,
            video as video, 
            thumbnail as thumbnail, 
            reply_to as reply_to
          FROM input
      merge:
        kind: ledger
        primaryKey:
          - id
    - kind: setVocab
      eventTimeColumn: date

Ran:

kamu init
kamu add test.yaml
kamu pull test

And everything worked just fine using the sample data you provided:

$ kamu tail test -n 3
┌────────┬─────────────────────────┬─────────────────────┬────┬────────────┬────────────┬──────────────┬──────────┬──────────┬──────┬────────┬───────────────┬────────────────┬───────┬──────────┬────────────────────────────┬─────────┬───────────┬───────┬───────────┬─────────────────────────────────────────────────────────────────┐
│ offset │       system_time       │        date         │ id │  username  │    name    │    tweet     │ language │ mentions │ urls │ photos │ replies_count │ retweets_count │ likes │ hashtags │            link            │ retweet │ quote_url │ video │ thumbnail │                            reply_to                             │
├────────┼─────────────────────────┼─────────────────────┼────┼────────────┼────────────┼──────────────┼──────────┼──────────┼──────┼────────┼───────────────┼────────────────┼───────┼──────────┼────────────────────────────┼─────────┼───────────┼───────┼───────────┼─────────────────────────────────────────────────────────────────┤
│     99 │ 2022-07-29 19:32:25.404 │ 2021-07-09 00:00:00 │    │ sampleuser │ samplename │ sample tweet │       en │       [] │   [] │     [] │             1 │              1 │     0 │       [] │ https://www.samplelink.com │   FALSE │           │     0 │           │ [{'screen_name': 'sample', 'name': 'smpl', 'id': '2320301582'}] │
│     98 │ 2022-07-29 19:32:25.404 │ 2021-06-28 00:00:00 │    │ sampleuser │ samplename │ sample tweet │       en │       [] │   [] │     [] │             2 │              0 │     0 │       [] │ https://www.samplelink.com │   FALSE │           │     0 │           │ [{'screen_name': 'sample', 'name': 'smpl', 'id': '2320301582'}] │
│     97 │ 2022-07-29 19:32:25.404 │ 2021-06-24 00:00:00 │    │ sampleuser │ samplename │ sample tweet │       en │       [] │   [] │     [] │             1 │              0 │     1 │       [] │ https://www.samplelink.com │   FALSE │           │     0 │           │ [{'screen_name': 'sample', 'name': 'smpl', 'id': '2320301582'}] │
└────────┴─────────────────────────┴─────────────────────┴────┴────────────┴────────────┴──────────────┴──────────┴──────────┴──────┴────────┴───────────────┴────────────────┴───────┴──────────┴────────────────────────────┴─────────┴───────────┴───────┴───────────┴─────────────────────────────────────────────────────────────────┘

I'm using latest version:

$ kamu --version
kamu 0.96.0

...but I don't think your problem is version-related.

Could it be that when you redacted your data to create a sample this made the problem go away?

In your screenshot where the error shows sample row: you can see:

  • first two values are offset and system_time columns that are added automatically by kamu
  • next two columns are supposed to be your id and date, but BOTH of them are NULL
  • judging by the offset == 0 - this should be the very first record in your input CSV which has this issue.

To debug this:

  1. Please try ingesting your own sample data and see if you get the issue. If yes - try updating to latest kamu version.
  2. If no - try ingesting a CSV with just the first row (otherwise unredacted) and see if you can isolate the problem.

Let me know what you find!

sergiimk avatar Jul 29 '22 20:07 sergiimk