datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Multiple Results of particular Dataset are missing with the information of schema and Platform Instance in UI

Open deepgarg760 opened this issue 7 months ago • 2 comments

Describe the bug While searching for particular dataset in dialogue box "Manage Upstream Lineage", results are not differentiated on the basis of Schema and Platform Instance, which makes it impossible to select the desired dataset.

To Reproduce

  1. Navigate to dataset, for which upstream or downstream lineage needs to be added
  2. Click Lineage Tab
  3. Click Edit, "Manage Upstream Lineage" dialogue box open
  4. select the dataset which has multiple instances in different schemas or Platform instance
  5. All the datasets fetched without schema and Platform Instance information

Expected behavior Datasets should be fetched with Schema and Platform Instance information

Screenshots

Image

Datahub Version 14.1

deepgarg760 avatar May 22 '25 12:05 deepgarg760

This is possible in the V2 UI on DataHub 1.0. We know it's a big switch, but we recommend swapping over to the new UI as it is being actively developed.

asikowitz avatar Jun 07 '25 20:06 asikowitz

Thanks for the update @asikowitz

deepgarg760 avatar Jun 09 '25 05:06 deepgarg760

I am trying to reproduce the issue with the current UI. However, it appears that you need data sources to reproduce it. The initial Docker setup does not include the data sources. What is the easiest way to add the appropriate data sources to reproduce the issue? And once the data sources are added, is it possible to have a direct link to the bug to try to reproduce it?

MaciekRakowski avatar Jun 19 '25 06:06 MaciekRakowski

You can ingest some sample data by running python -m datahub docker ingest-sample-data. If you are running datahub locally with authentication on (default), you'll have to generate a token and then specify it python -m datahub docker ingest-sample-data --token <token>.

To reproduce the bug, make sure you're on the V1 UI (may have to go to settings -> appearance -> unselect "Try DataHub 2.0 (beta)"), then go to any dataset entity (e.g. http://localhost:9002/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)/Columns), go to the lineage tab, and click edit upstream or downstream). To fully reproduce, you'll need to add data that has a platform instance which is missing from our sample data. But if you're able to get this to display the database / schema, think that is good enough.

asikowitz avatar Jun 20 '25 18:06 asikowitz

How do I generate a token? When I go to the tokens page, it shows this below. Is there anything I should do to be able to create a token? Or do I generate it differently?

Image

MaciekRakowski avatar Jun 21 '25 23:06 MaciekRakowski

Ah if it's disabled then you don't need a token at all, you can just run python -m datahub docker ingest-sample-data

asikowitz avatar Jun 23 '25 23:06 asikowitz

It looks like my issue with not seeing the same screen as shown in the reproduction steps was the default UI version of 2.0. I ran the data ingestion and changed it to 1.0 and am able to see the screen. Can I know which specific dataset I can choose from the dropdown to reproduce the issue? On step 4 it says to choose a dataset that "has multiple instances in different schemas or Platform instance".

Image

MaciekRakowski avatar Jun 24 '25 06:06 MaciekRakowski

There isn't an issue with a specific dataset. Rather, the complaint here is that entities with the same platform, entity type, and entity name are indistinguishable, and the request is to display extra information: platform instance and database / schema. We don't have such entities in the seed data, but you can still work on displaying this extra information in two places: (i) on the list of related entities and (ii) on the search cards when searching to add more related entities. We should display this like we do on the entity header, i.e. in your screenshot, towards the top left, we describe the entity as: "Dataset | Hive > datahub_db > datahub_schema".

asikowitz avatar Jun 24 '25 16:06 asikowitz

I created a fix for this issue and have a PR for it. I tested it locally and it seems to work.

Here it is: https://github.com/datahub-project/datahub/pull/13856

Feel free to leave any comments.

MaciekRakowski avatar Jun 25 '25 06:06 MaciekRakowski

Some of the checks are failing, including lint. This time, it does not give a specific lint error. I ran lint locally on the file I changed and I got no errors. I'm not sure how to fix the pipeline errors.

MaciekRakowski avatar Jun 25 '25 07:06 MaciekRakowski

Hi @asikowitz ,

I submitted a PR a few days ago. It shows that the checks do not pass, but when I look closer, the one that is failing is the deployment task. The specific error shows this:

https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856

Run cloudflare/pages-action@1
  with:
    projectName: datahub-project-web-react
    workingDirectory: datahub-web-react
    directory: dist
    gitHubToken: ***
    wranglerVersion: [2](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:2)
  env:
    JAVA_HOME: /opt/hostedtoolcache/Java_Zulu_jdk/17.0.15-6/x6[4](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:4)
    JAVA_HOME_17_X64: /opt/hostedtoolcache/Java_Zulu_jdk/17.0.1[5](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:5)-6/x64
    GRADLE_BUILD_ACTION_SETUP_COMPLETED: true
    GRADLE_BUILD_ACTION_CACHE_RESTORED: true
    DEVELOCITY_INJECTION_INIT_SCRIPT_NAME: gradle-actions.inject-develocity.init.gradle
    DEVELOCITY_AUTO_INJECTION_CUSTOM_VALUE: gradle-actions
    GITHUB_DEPENDENCY_GRAPH_ENABLED: false
Error: Input required and not supplied: apiToken

However, all areas within my control, such as linting, unit tests, and unit test coverage, pass. I believe the deployment issue is outside of my control. It may be because I'm creating a PR from my forked branch.

Are you or someone from your team able to review my PR? https://github.com/datahub-project/datahub/pull/13856

MaciekRakowski avatar Jun 30 '25 01:06 MaciekRakowski