dbldatagen icon indicating copy to clipboard operation
dbldatagen copied to clipboard

DateRange Timestamp conversion on windows fails with timestamps close to EPOCH

Open bdemirtas opened this issue 1 year ago • 3 comments

Expected Behavior

When you use DateRange with starting date before 1970 (EPOCH) it raise OSError. OSError: [Errno 22] Invalid argument Linking the bug ticket from Python https://bugs.python.org/issue37527

Current Behavior

Currently it works as intended for any non Windows OS . The work around is to provide the datetime with a timezone utc.

Steps to Reproduce (for bugs)

This code will fail and raise OSError on windows.

testDataSpec = (
    dg.DataGenerator( spark, name="test_data_set1", rows=1000 partitions=4)
    .withColumn(
        "purchase_date",
        "date",
        data_range=dg.DateRange("1910-10-01 00:00:00", "1950-10-06 11:55:00", "days=3"),
        random=True,
    )
)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

bdemirtas avatar Apr 10 '24 03:04 bdemirtas

Can you provide more details of the workaround? If there's a valid workaround, we will document it but as intended runtime environment is Databricks cloud environment and it is tested under cloud environment and local Linux or similar environment, we cannot validate it.

While we don't block it running on other environments, the intent is to support it running on a Databricks cloud environment or developing locally in preparation for use on a Databricks cloud environment.

ronanstokes-db avatar Jun 17 '24 18:06 ronanstokes-db

The folllowing example shows use of DateTime instances to define the range:

import dbldatagen as dg
from datetime import datetime, timezone

startingTime = datetime.fromisoformat("1910-10-01T00:00:00").replace(tzinfo=timezone.utc)
endingTime = datetime.fromisoformat("1950-10-06T11:55:00").replace(tzinfo=timezone.utc)

testDataSpec = (
    dg.DataGenerator( spark, name="test_data_set1", rows=1000, partitions=4)
    .withColumn(
        "purchase_date",
        "date",
        data_range=dg.DateRange(startingTime, endingTime, "days=3"),
        random=True,
    )
)

display(testDataSpec.build())

ronanstokes-db avatar Jun 27 '24 18:06 ronanstokes-db

Sorry for the late answer. Yes it's what I use as workaround or also this one works too.

from datetime import datetime, timezone

now  = datetime.now()
utc_now = now.astimezone(timezone.utc)

bdemirtas avatar Aug 25 '24 16:08 bdemirtas