ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: Timestamp timezone awareness differences between the Databricks and PySpark/spark-connect backends

Open kyrre opened this issue 1 year ago • 1 comments

What happened?

The Databricks and PySpark/spark-connect backends uses a different timestamp type. The Databricks backend always returns a timezone-aware timestamp and the PySpark does not.

For example, using PySpark/spark-connect and the following query:

(
    con.table(
        "device_process_events",
        database="old_security_logs.mde",
    )
    .select(_.time)
)

returns a timezone-naive timestamp:

r0 := DatabaseTable: old_security_logs.mde.device_process_events
  time          timestamp
  operationName string
  category      string
  tenantId      string
  properties    Timestamp:                                    timestamp
  DeviceId:                                     string
  DeviceName:                                   string
  ActionType:                                   string
  FileName:                                     string
  FolderPath:                                   string
  SHA1:                                         string
  SHA256:                                       string
  MD5:                                          string
  FileSize:                                     int64
  ProcessVersionInfoCompanyName:                string
  ProcessVersionInfoProductName:                string
  ProcessVersionInfoProductVersion:             string
  ProcessVersionInfoInternalFileName:           string
  ProcessVersionInfoOriginalFileName:           string
  ProcessVersionInfoFileDescription:            string
  ProcessId:                                    int32
  ProcessCommandLine:                           string
  ProcessIntegrityLevel:                        string
  ProcessTokenElevation:                        string
  ProcessCreationTime:                          timestamp
  AccountDomain:                                string
  AccountName:                                  string
  AccountSid:                                   string
  AccountUpn:                                   string
  AccountObjectId:                              string
  LogonId:                                      string
  InitiatingProcessAccountDomain:               string
  InitiatingProcessAccountName:                 string
  InitiatingProcessAccountSid:                  string
  InitiatingProcessAccountUpn:                  string
  InitiatingProcessAccountObjectId:             string
  InitiatingProcessLogonId:                     string
  InitiatingProcessIntegrityLevel:              string
  InitiatingProcessTokenElevation:              string
  InitiatingProcessSHA1:                        string
  InitiatingProcessSHA256:                      string
  InitiatingProcessMD5:                         string
  InitiatingProcessFileName:                    string
  InitiatingProcessFileSize:                    int64
  InitiatingProcessVersionInfoCompanyName:      string
  InitiatingProcessVersionInfoProductName:      string
  InitiatingProcessVersionInfoProductVersion:   string
  InitiatingProcessVersionInfoInternalFileName: string
  InitiatingProcessVersionInfoOriginalFileName: string
  InitiatingProcessVersionInfoFileDescription:  string
  InitiatingProcessId:                          int32
  InitiatingProcessCommandLine:                 string
  InitiatingProcessCreationTime:                timestamp
  InitiatingProcessFolderPath:                  string
  InitiatingProcessParentId:                    int32
  InitiatingProcessParentFileName:              string
  InitiatingProcessParentCreationTime:          timestamp
  InitiatingProcessSignerType:                  string
  InitiatingProcessSignatureStatus:             string
  ReportId:                                     int64
  AppGuardContainerId:                          string
  AdditionalFields:                             string
  MachineGroup:                                 string
  Tenant        string
  _rescued_data string
  timestamp     timestamp
  parse_details status: string
  at:     timestamp
  info:   input-file-name: string
  p_date        string

Project[r0]
  time: r0.time

while the same query with the Databricks backend:

(
    con2.table(
        "device_process_events",
        database="old_security_logs.mde",
    )
    .select(_.time)
)

use a timestame timezone-aware type:

r0 := DatabaseTable: old_security_logs.mde.device_process_events
  time          timestamp('UTC')
  operationName string
  category      string
  tenantId      string
  properties    Timestamp:                                    timestamp('UTC')
  DeviceId:                                     string
  DeviceName:                                   string
  ActionType:                                   string
  FileName:                                     string
  FolderPath:                                   string
  SHA1:                                         string
  SHA256:                                       string
  MD5:                                          string
  FileSize:                                     int64
  ProcessVersionInfoCompanyName:                string
  ProcessVersionInfoProductName:                string
  ProcessVersionInfoProductVersion:             string
  ProcessVersionInfoInternalFileName:           string
  ProcessVersionInfoOriginalFileName:           string
  ProcessVersionInfoFileDescription:            string
  ProcessId:                                    int32
  ProcessCommandLine:                           string
  ProcessIntegrityLevel:                        string
  ProcessTokenElevation:                        string
  ProcessCreationTime:                          timestamp('UTC')
  AccountDomain:                                string
  AccountName:                                  string
  AccountSid:                                   string
  AccountUpn:                                   string
  AccountObjectId:                              string
  LogonId:                                      string
  InitiatingProcessAccountDomain:               string
  InitiatingProcessAccountName:                 string
  InitiatingProcessAccountSid:                  string
  InitiatingProcessAccountUpn:                  string
  InitiatingProcessAccountObjectId:             string
  InitiatingProcessLogonId:                     string
  InitiatingProcessIntegrityLevel:              string
  InitiatingProcessTokenElevation:              string
  InitiatingProcessSHA1:                        string
  InitiatingProcessSHA256:                      string
  InitiatingProcessMD5:                         string
  InitiatingProcessFileName:                    string
  InitiatingProcessFileSize:                    int64
  InitiatingProcessVersionInfoCompanyName:      string
  InitiatingProcessVersionInfoProductName:      string
  InitiatingProcessVersionInfoProductVersion:   string
  InitiatingProcessVersionInfoInternalFileName: string
  InitiatingProcessVersionInfoOriginalFileName: string
  InitiatingProcessVersionInfoFileDescription:  string
  InitiatingProcessId:                          int32
  InitiatingProcessCommandLine:                 string
  InitiatingProcessCreationTime:                timestamp('UTC')
  InitiatingProcessFolderPath:                  string
  InitiatingProcessParentId:                    int32
  InitiatingProcessParentFileName:              string
  InitiatingProcessParentCreationTime:          timestamp('UTC')
  InitiatingProcessSignerType:                  string
  InitiatingProcessSignatureStatus:             string
  ReportId:                                     int64
  AppGuardContainerId:                          string
  AdditionalFields:                             string
  MachineGroup:                                 string
  Tenant        string
  _rescued_data string
  timestamp     timestamp('UTC')
  parse_details status: string
  at:     timestamp('UTC')
  info:   input-file-name: string
  p_date        string

Project[r0]
  time: r0.time

Not sure this is an actually bug, but I could not find a way to configure/enforce the same behaviour for both backends.

What version of ibis are you using?

commit d55a5eee5a9fa45d53ea51775f53d1215ed819ac

What backend(s) are you using, if any?

Databricks, PySpark/databricks-connect

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

kyrre avatar Apr 18 '25 08:04 kyrre

The main issue here is that PySpark broke backwards compatibility when they introduced TimestampNTZType (timestamp without a time zone) and changed TimestampType to mean "timestamp with time zone".

I'm going to see how difficult it is for us to support both by just mapping the types differently, depending on whether pyspark.sql.types.TimestampNTZType is a valid attribute.

I suspect the main difficulty will be in tests, mainly dealing with different behaviors for different versions of spark 😬

cpcloud avatar Apr 21 '25 12:04 cpcloud