datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Vertica timestamp(0) causes an unable to get column information error in DataHub CLI

Open martinontcode opened this issue 2 years ago • 10 comments

Describe the bug A warning is received in Datahub CLI while executing ingestion from Vertica DB via Vertica plugin. Warning output: 'warnings': {'XXX_TABLE': ['unable to get column information due to an error -> __init__() got an unexpected keyword ' "argument 'precision'"]}

It seems that the error is caused when an attribute data type is set to timestamp(0). The error can be bypassed when commenting out the following lines in venv\Lib\site-packages\datahub\ingestion\source\sql\vertica.py, _get_column_info function.

elif attype in ("timestamp", "time"):
        kwargs["timezone"] = False
        # if charlen:
        #     kwargs["precision"] = int(charlen)
        args = ()

The created kwargs object precision attribute isn't compatible with the timestamp parameters: {'timezone': False, 'precision': 0}, it's possible that the same error would appear when working with timestampz data type.

Commenting most likely works as timestamp precision is optional - TIMESTAMP/TIMESTAMPTZ

To Reproduce Create a table in Vertica including an attribute with timestamp(0):

CREATE TABLE YYY.XXX_TABLE
(
    load_timestamp timestamp(0)
);

Execute the pipeline using datahub CLI.

Expected behavior Load meta from Vertica to file using datahub CLI w/o warnings.

Desktop (please complete the following information):

  • Python 3.8.2
  • acryl-datahub 0.8.40 / DataHub CLI 0.8.40
  • wheel 0.37.1
  • setuptools 62.6.0
  • pip 22.1.2

martinontcode avatar Jun 30 '22 08:06 martinontcode

Hi, is there a possibility of receiving an ETA on when this might be resolved?

martinontcode avatar Jul 12 '22 05:07 martinontcode

Some additional information regarding this issue.

In case of timestamp(x) or timestamptz(x) as mentioned, the following warning is raised: 'warnings': {'XSCHEMA.YTABLE': ['unable to get column information due to an error -> __init__() got an unexpected keyword ' "argument 'precision'"]}

The precision argument is initiated in \Lib\site-packages\datahub\ingestion\source\sql\vertica.py:

    elif attype in ("timestamptz", "timetz"):
        kwargs["timezone"] = True
        if charlen:
            kwargs["precision"] = int(charlen)  # type: ignore
        args = ()  # type: ignore
    elif attype in ("timestamp", "time"):
        kwargs["timezone"] = False
        if charlen:
            kwargs["precision"] = int(charlen)  # type: ignore

This calls for the TIMESTAMP class in \Lib\site-packages\sqlalchemy\sql\sqltypes.py:

class TIMESTAMP(DateTime):

    """The SQL TIMESTAMP type.

    :class:`_types.TIMESTAMP` datatypes have support for timezone
    storage on some backends, such as PostgreSQL and Oracle.  Use the
    :paramref:`~types.TIMESTAMP.timezone` argument in order to enable
    "TIMESTAMP WITH TIMEZONE" for these backends.

    """

    __visit_name__ = "TIMESTAMP"

    def __init__(self, timezone=False):
        """Construct a new :class:`_types.TIMESTAMP`.

        :param timezone: boolean.  Indicates that the TIMESTAMP type should
         enable timezone support, if available on the target database.
         On a per-dialect basis is similar to "TIMESTAMP WITH TIMEZONE".
         If the target database does not support timezones, this flag is
         ignored.


        """
        super(TIMESTAMP, self).__init__(timezone=timezone)

    def get_dbapi_type(self, dbapi):
        return dbapi.TIMESTAMP

Which returns the precision error as there's no precision argument defined.

martinontcode avatar Jul 13 '22 11:07 martinontcode

So an easy fix would be to add the following argument and code line into the class:

class TIMESTAMP(DateTime):

    """The SQL TIMESTAMP type.

    :class:`_types.TIMESTAMP` datatypes have support for timezone
    storage on some backends, such as PostgreSQL and Oracle.  Use the
    :paramref:`~types.TIMESTAMP.timezone` argument in order to enable
    "TIMESTAMP WITH TIMEZONE" for these backends.

    """

    __visit_name__ = "TIMESTAMP"

    def __init__(self, timezone=False, precision=None):
        """Construct a new :class:`_types.TIMESTAMP`.

        :param timezone: boolean.  Indicates that the TIMESTAMP type should
         enable timezone support, if available on the target database.
         On a per-dialect basis is similar to "TIMESTAMP WITH TIMEZONE".
         If the target database does not support timezones, this flag is
         ignored.


        """
        super(TIMESTAMP, self).__init__(timezone=timezone)
        self.precision = precision

    def get_dbapi_type(self, dbapi):
        return dbapi.TIMESTAMP

Which results in the following schema output:

{
"fieldPath": "DATE_CREATED",
"jsonPath": null,
"nullable": true,
"description": null,
"created": null,
"lastModified": null,
"type": {
    "type": {
        "com.linkedin.pegasus2avro.schema.TimeType": {}
    }
},
"nativeDataType": "TIMESTAMP(precision=0)",
"recursive": false,
"globalTags": null,
"glossaryTerms": null,
"isPartOfKey": false,
"isPartitioningKey": null,
"jsonProps": null
}

martinontcode avatar Jul 13 '22 11:07 martinontcode

Hi martinontcode! Would you mind opening a PR with the fix so we can move it through review?

maggiehays avatar Jul 19 '22 22:07 maggiehays

This issue is stale because it has been open for 15 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Aug 04 '22 02:08 github-actions[bot]

Hi, the issue is still present. The complexity of this issue comes from the fact that the used import from sqlalchemy.sql import sqltypes has the missing "precision" parameter.

Is there a way to fix this from Datahub perspective?

martinontcode avatar Aug 26 '22 06:08 martinontcode

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Oct 04 '22 02:10 github-actions[bot]

I submitted a PR attempting to fix this problem. It basically is a hacky implementation of the suggestion @martinontcode made.

inancdokurel avatar Oct 26 '22 21:10 inancdokurel

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Nov 27 '22 02:11 github-actions[bot]

This still is a problem in the current version. Could anybody review and comment on the opened PR?

inancdokurel avatar Nov 29 '22 11:11 inancdokurel