python-driver icon indicating copy to clipboard operation
python-driver copied to clipboard

Fix precision loss for large CQL timestamp values

Open Copilot opened this issue 2 months ago • 3 comments

Pre-review checklist

  • [x] I have split my patch into logically separate commits.
  • [x] All commit messages clearly explain what they change and why.
  • [x] I added relevant tests for new features and bug fixes.
  • [x] All commits compile, pass static checks and pass test.
  • [x] PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.

Description

CQL timestamp values (signed milliseconds from epoch) lose precision when they are further than ~300 years from Unix epoch due to floating-point conversions during serialization/deserialization. This causes issues where a timestamp retrieved from the database cannot be used to query for the same row.

Problem

The driver was converting int64 milliseconds to floating-point values during serialization/deserialization, causing precision loss for timestamps more than ~300 years from the Unix epoch (the maximum number of milliseconds that can be exactly represented in a double).

Example scenario:

  • Insert timestamp "2300-01-01 00:00:00.001"
  • Retrieve it back: "2300-01-01 00:00:00.000999" (precision lost)
  • Use retrieved value in a query: fails because it's serialized as "2300-01-01 00:00:00.000"

Root Causes

  1. Deserialization divides int64 milliseconds by 1000.0 (float), losing precision
  2. Serialization uses floating-point arithmetic (*1e3, /1e3) to calculate milliseconds
  3. Both Python and Cython implementations had this issue

Solution

Implemented integer-only arithmetic throughout the timestamp conversion path:

  1. Added new conversion functions:

    • datetime_from_timestamp_ms() in cassandra/util.py (Python path)
    • datetime_from_timestamp_ms() in cassandra/cython_utils.pyx (Cython path)
    • Both functions use integer arithmetic exclusively to avoid precision loss
    • Note: The existing datetime_from_timestamp() function is preserved as it's still used by UUID utilities (datetime_from_uuid1()) and the Date class for seconds-based conversions
  2. Updated DateType serialization/deserialization:

    • cassandra/cqltypes.py: Changed DateType.deserialize() to call the new function directly on milliseconds (no division)
    • cassandra/cqltypes.py: Changed DateType.serialize() to use integer operations (*1000, //1000) instead of float operations (*1e3, /1e3)
    • cassandra/deserializers.pyx: Updated Cython deserializer to use the new integer-based function and removed unused import
  3. Code cleanup:

    • Removed unnecessary negative remainder check from datetime_from_timestamp_ms() - Python's modulo operator always returns non-negative results for positive divisors
    • Removed unused datetime_from_timestamp import from cassandra/deserializers.pyx
  4. Added comprehensive test coverage:

    • Created tests/unit/test_timestamp_precision.py with 4 unit tests:
      • Large timestamp roundtrip (year 2300)
      • Various large timestamps including negative values (year 1640 BCE)
      • Small timestamps to ensure no regression
      • The exact scenario from the issue report
    • Added integration test test_large_timestamp_precision() in tests/integration/standard/test_types.py that verifies round-trip identity for large timestamps (years 2300, 1640, 2500) on a real cluster

Testing

  • ✅ All new unit tests pass (4/4)
  • ✅ All existing DateType tests pass
  • ✅ All unit type tests pass (62 passed, 1 skipped)
  • ✅ Integration test added for real cluster verification
  • ✅ No regressions detected
  • ✅ Issue scenario validated - timestamps like "2300-01-01 00:00:00.001" now maintain full precision through serialization/deserialization cycles
  • ✅ Negative timestamp handling verified to work correctly

The changes are minimal and surgical, affecting only the timestamp conversion logic without altering any other functionality.

Original prompt

This section details on the original issue you should resolve

<issue_title>Large values of CQL type timestamp lose precision due to some floating-point conversions along the way</issue_title> <issue_description>timestamp CQL values are integers (signed milliseconds from epoch), but they get converted to double somewhere along the way, so they lose precision if they are further than ~300 years from the Unix epoch. (~300 years is the max number of milliseconds which can be exactly represented in a double).

In particular, you can e.g. select a timestamp key from a table, then use the obtained key to query the table for that particular key, and the query will return an empty result.

Consider:

    keyspace_name = "ks"
    table_name = "ks"
    session.execute(f"DROP KEYSPACE IF EXISTS {keyspace_name}")
    session.execute(f"""
        CREATE KEYSPACE IF NOT EXISTS {keyspace_name}
        WITH replication = {{'class': 'SimpleStrategy', 'replication_factor': 1}}
    """)
    session.execute(f"""
        CREATE TABLE {keyspace_name}.{table_name} (
            pk timestamp PRIMARY KEY
        )
    """)
    test_timestamp = "2300-01-01 00:00:00.001"
    insert = session.execute(f"INSERT INTO {keyspace_name}.{table_name}(pk) VALUES('{test_timestamp}')")

    scan_result = session.execute(f"SELECT * FROM {keyspace_name}.{table_name}")
    scan_rows = list(scan_result)
    assert len(scan_rows) == 1

    retrieved_timestamp = scan_rows[0].pk
    logger.debug(f"Original timestamp: {test_timestamp}")
    logger.debug(f"Retrieved timestamp: {retrieved_timestamp}")

    key_rows = list(session.execute(f"SELECT * FROM {keyspace_name}.{table_name} WHERE pk = '{test_timestamp}'"))
    assert len(key_rows) == 1

    key_select = session.prepare(f"SELECT * FROM {keyspace_name}.{table_name} WHERE pk = ?")
    key_rows = list(session.execute(key_select, [retrieved_timestamp]))
    assert len(key_rows) == 1 # FAIL

    session.execute(f"DROP KEYSPACE {keyspace_name}")

This will give the log output:

2025-09-11 23:12:40,117 - DEBUG - Original timestamp: 2300-01-01 00:00:00.001
2025-09-11 23:12:40,117 - DEBUG - Retrieved timestamp: 2300-01-01 00:00:00.000999

And the test will fail on the assert marked with the FAIL comment, because the actual key is 2300-01-01 00:00:00.001, the key converted to Python native types is 2300-01-01 00:00:00.001 - epsilon, and the key sent back with the last SELECT is 2300-01-01 00:00:00.000.</issue_description>

Comments on the Issue (you are @copilot in this section)

  • Fixes scylladb/python-driver#532

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar Dec 27 '25 11:12 Copilot

@copilot please complete the work on the PR

mykaul avatar Dec 27 '25 12:12 mykaul

@copilot please complete the work on the PR

The PR is now complete. All changes have been implemented and tested:

Changes made (commit fbab5cd):

  • Added integer-based timestamp conversion functions to avoid floating-point precision loss
  • Updated serialization/deserialization in both Python and Cython code paths
  • Added comprehensive test suite that validates the fix

Testing results:

  • New timestamp precision tests: ✅ All 4 tests pass
  • Existing DateType tests: ✅ Pass
  • All unit type tests (63 tests): ✅ 62 passed, 1 skipped
  • Issue scenario validation: ✅ Timestamps far from epoch now maintain full precision

The fix ensures that timestamps like "2300-01-01 00:00:00.001" can be correctly round-tripped through serialization/deserialization without losing precision.

Copilot avatar Dec 27 '25 12:12 Copilot

@copilot Please also add an integration test case in /tests/integration/standard/test_types.py that verifies round-trip identity for such types on a real cluster.

Added integration test in commit 38911a3. The test verifies round-trip identity for large timestamps (years 2300, 1640, 2500) on a real cluster, ensuring millisecond precision is maintained through serialization and deserialization.

Copilot avatar Jan 09 '26 19:01 Copilot