OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

Tableau Connector : Unify Data Models

Open jsampaiog opened this issue 1 year ago • 9 comments

Is your feature request related to a problem? Please describe. When ingesting Data Models in tableau, multiple datamodels are displayed for the same data source. This explodes the number of total data sources, even though unique, and makes discovery and lineage more complicated. image

Describe the solution you'd like Today OMD relies on the nodes segment of Tableau metadata to create the data model.

  embeddedDatasourcesConnection(first: {first}, offset: {offset} ) {{
    nodes {{
      id
      name
      fields {{
        id
        name
        upstreamColumns{{
          id
          name
          remoteType
        }}

But perhaps a better way would be to create the data model based on the root data model, since these share the same ID across the models image

jsampaiog avatar Feb 16 '24 10:02 jsampaiog

@harshach I wanted to chime in on this conversation. At my organisation, we're ingesting a large Tableau instance and we've also noticed this behavior where there are multiple versions of the same datasource. What we found, is that a workbook (dashboard) can have it's own embedded datasource, that is a workbook unique version of an upstream datasource that it connects to (usually one that exists on Tableau server). The reason for this, seems to be, that a workbook can connect to a datasource, then change field names, add calculated fields and do various other things to have it's own version of the connected datasource.

Ideally what we would like (and @jsampaiog please jump in if you disagree); is for published datasources to be ingested into OpenMetadata as well as the embedded datasources (would be nice to have a new icon to differentiate the datamodels).

I think it's important to keep both the published and embedded datasources, because that way we can see what transformations have occurred at the workbook level and compare it to the published server model.

Here's a screenshot of what it might look like:

image

chillerno1 avatar Feb 27 '24 23:02 chillerno1

Hi @chillerno1, thanks for chiming in. Indeed your depicted behavior would be the best target scenario! But we also brainstormed internally, and as a matter of fact, in order to avoid complexifying OpenMetaData Data Model, if we were forced to choose between "Published datasources" and "Embedded datasources", we would stick with the first.

jsampaiog avatar Mar 06 '24 10:03 jsampaiog

Thanks @jsampaiog, I agree with that!

chillerno1 avatar Mar 11 '24 20:03 chillerno1

Chiming in too. Pretty much I have a similar scenario to what @jsampaiog described.

A published data source that then it's used in multiple places. We are planning to use this for more scenarios, therefore the amount of data models can simply explode. What @jsampaiog suggested in the original issue seems the way to go:

But perhaps a better way would be to create the data model based on the root data model, since these share the same ID across the models

nicor88 avatar Apr 19 '24 17:04 nicor88

@pmbrull are you still planning to include this in 1.4.0 release? I see that was removed :(

nicor88 avatar Apr 19 '24 17:04 nicor88

hi we had to reprioritize certain topics and ran out of time to handle this, so 1.4.1 - 1.5 would be the new ETA.

My 2 cents on the conversation above is to keep things simple. Aiming to keep the Published DataModel IMO would be the way to go to reduce complexity

pmbrull avatar Apr 22 '24 08:04 pmbrull

@pmbrull thanks for the context on the timelines.

I believe that "Published DataModel" should do the job even in case of "Dashboard" with embedded data models. We just need to be sure that we don't introduce a regression, where data models are totally missing.

nicor88 avatar Apr 22 '24 09:04 nicor88

Thanks @OnkarVO7 , for the this thread.

We also have similar problem of having duplicate Models rather a combined model for all workbooks down the stream.

Since currently OMD use this query

query { embeddedDatasourcesConnection(filter: {name: "Tech Data Model"}) { nodes{ id name workbook { id name } } totalCount } }

We checked with Tableau team (spent a lot of time with Tableau support team to get information in right way) and they proposed to use below query

query { publishedDatasources(filter: {name: "Tech Data Model"}) { id name hasExtracts downstreamWorkbooks{ id luid name } } }

17rahulsharma avatar Apr 30 '24 07:04 17rahulsharma

Hi,

Sorry for commenting in this thread, we are facing the same situation: the sources are duplicated for each workbook (dashboard in OM) that we ingested.

image

The dashboard datamodel exists only once on Tableau: image

If the object exists only once, we can trace lineage with the workbooks, assign the owner once, not make them independent objects. image

In addition, we have it separated by different services, each service is a tableau folder, since this allows us to assign owner by folder, perhaps, if in the ingestion the folder (tableau) is ingested as the database service would allow us to maintain that hierarchy that also allows us to filter by folder:

DB Ingestion->Schema->table Tableau Ingestion >Folder->Workbook & datamodels

thanks, Carlos

triquinielas avatar Apr 30 '24 13:04 triquinielas

Here a recap on the conversation that I had with @OnkarVO7 .

  1. the query, must be changed to add this section:
     upstreamDatasources {
          id
          luid
          name
          description
          hasExtracts
          tags {
            id
          }
          fields {
            id
            name
            isHidden
          }
          upstreamTables {
            id
            luid
            name
            fullName
            schema
            referencedByQueries {
              id
              name
              query
            }
            columns {
              id
              name
            }
            database {
              id
              name
            }
          }
        }
  1. the ingestion must have a logic to use the new field upstreamDatasources. If the upstreamDatasources is not empty (that's the case of publishedDatasources) we need to publish a new data model node and link it to the underyling data-source downstream and upstream to the related data-model in tableau.

nicor88 avatar Jul 08 '24 12:07 nicor88