bug: Timestamp timezone awareness differences between the Databricks and PySpark/spark-connect backends
What happened?
The Databricks and PySpark/spark-connect backends uses a different timestamp type. The Databricks backend always returns a timezone-aware timestamp and the PySpark does not.
For example, using PySpark/spark-connect and the following query:
(
con.table(
"device_process_events",
database="old_security_logs.mde",
)
.select(_.time)
)
returns a timezone-naive timestamp:
r0 := DatabaseTable: old_security_logs.mde.device_process_events
time timestamp
operationName string
category string
tenantId string
properties Timestamp: timestamp
DeviceId: string
DeviceName: string
ActionType: string
FileName: string
FolderPath: string
SHA1: string
SHA256: string
MD5: string
FileSize: int64
ProcessVersionInfoCompanyName: string
ProcessVersionInfoProductName: string
ProcessVersionInfoProductVersion: string
ProcessVersionInfoInternalFileName: string
ProcessVersionInfoOriginalFileName: string
ProcessVersionInfoFileDescription: string
ProcessId: int32
ProcessCommandLine: string
ProcessIntegrityLevel: string
ProcessTokenElevation: string
ProcessCreationTime: timestamp
AccountDomain: string
AccountName: string
AccountSid: string
AccountUpn: string
AccountObjectId: string
LogonId: string
InitiatingProcessAccountDomain: string
InitiatingProcessAccountName: string
InitiatingProcessAccountSid: string
InitiatingProcessAccountUpn: string
InitiatingProcessAccountObjectId: string
InitiatingProcessLogonId: string
InitiatingProcessIntegrityLevel: string
InitiatingProcessTokenElevation: string
InitiatingProcessSHA1: string
InitiatingProcessSHA256: string
InitiatingProcessMD5: string
InitiatingProcessFileName: string
InitiatingProcessFileSize: int64
InitiatingProcessVersionInfoCompanyName: string
InitiatingProcessVersionInfoProductName: string
InitiatingProcessVersionInfoProductVersion: string
InitiatingProcessVersionInfoInternalFileName: string
InitiatingProcessVersionInfoOriginalFileName: string
InitiatingProcessVersionInfoFileDescription: string
InitiatingProcessId: int32
InitiatingProcessCommandLine: string
InitiatingProcessCreationTime: timestamp
InitiatingProcessFolderPath: string
InitiatingProcessParentId: int32
InitiatingProcessParentFileName: string
InitiatingProcessParentCreationTime: timestamp
InitiatingProcessSignerType: string
InitiatingProcessSignatureStatus: string
ReportId: int64
AppGuardContainerId: string
AdditionalFields: string
MachineGroup: string
Tenant string
_rescued_data string
timestamp timestamp
parse_details status: string
at: timestamp
info: input-file-name: string
p_date string
Project[r0]
time: r0.time
while the same query with the Databricks backend:
(
con2.table(
"device_process_events",
database="old_security_logs.mde",
)
.select(_.time)
)
use a timestame timezone-aware type:
r0 := DatabaseTable: old_security_logs.mde.device_process_events
time timestamp('UTC')
operationName string
category string
tenantId string
properties Timestamp: timestamp('UTC')
DeviceId: string
DeviceName: string
ActionType: string
FileName: string
FolderPath: string
SHA1: string
SHA256: string
MD5: string
FileSize: int64
ProcessVersionInfoCompanyName: string
ProcessVersionInfoProductName: string
ProcessVersionInfoProductVersion: string
ProcessVersionInfoInternalFileName: string
ProcessVersionInfoOriginalFileName: string
ProcessVersionInfoFileDescription: string
ProcessId: int32
ProcessCommandLine: string
ProcessIntegrityLevel: string
ProcessTokenElevation: string
ProcessCreationTime: timestamp('UTC')
AccountDomain: string
AccountName: string
AccountSid: string
AccountUpn: string
AccountObjectId: string
LogonId: string
InitiatingProcessAccountDomain: string
InitiatingProcessAccountName: string
InitiatingProcessAccountSid: string
InitiatingProcessAccountUpn: string
InitiatingProcessAccountObjectId: string
InitiatingProcessLogonId: string
InitiatingProcessIntegrityLevel: string
InitiatingProcessTokenElevation: string
InitiatingProcessSHA1: string
InitiatingProcessSHA256: string
InitiatingProcessMD5: string
InitiatingProcessFileName: string
InitiatingProcessFileSize: int64
InitiatingProcessVersionInfoCompanyName: string
InitiatingProcessVersionInfoProductName: string
InitiatingProcessVersionInfoProductVersion: string
InitiatingProcessVersionInfoInternalFileName: string
InitiatingProcessVersionInfoOriginalFileName: string
InitiatingProcessVersionInfoFileDescription: string
InitiatingProcessId: int32
InitiatingProcessCommandLine: string
InitiatingProcessCreationTime: timestamp('UTC')
InitiatingProcessFolderPath: string
InitiatingProcessParentId: int32
InitiatingProcessParentFileName: string
InitiatingProcessParentCreationTime: timestamp('UTC')
InitiatingProcessSignerType: string
InitiatingProcessSignatureStatus: string
ReportId: int64
AppGuardContainerId: string
AdditionalFields: string
MachineGroup: string
Tenant string
_rescued_data string
timestamp timestamp('UTC')
parse_details status: string
at: timestamp('UTC')
info: input-file-name: string
p_date string
Project[r0]
time: r0.time
Not sure this is an actually bug, but I could not find a way to configure/enforce the same behaviour for both backends.
What version of ibis are you using?
commit d55a5eee5a9fa45d53ea51775f53d1215ed819ac
What backend(s) are you using, if any?
Databricks, PySpark/databricks-connect
Relevant log output
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
The main issue here is that PySpark broke backwards compatibility when they introduced TimestampNTZType (timestamp without a time zone) and changed TimestampType to mean "timestamp with time zone".
I'm going to see how difficult it is for us to support both by just mapping the types differently, depending on whether pyspark.sql.types.TimestampNTZType is a valid attribute.
I suspect the main difficulty will be in tests, mainly dealing with different behaviors for different versions of spark 😬