Multiple Results of particular Dataset are missing with the information of schema and Platform Instance in UI
Describe the bug While searching for particular dataset in dialogue box "Manage Upstream Lineage", results are not differentiated on the basis of Schema and Platform Instance, which makes it impossible to select the desired dataset.
To Reproduce
- Navigate to dataset, for which upstream or downstream lineage needs to be added
- Click Lineage Tab
- Click Edit, "Manage Upstream Lineage" dialogue box open
- select the dataset which has multiple instances in different schemas or Platform instance
- All the datasets fetched without schema and Platform Instance information
Expected behavior Datasets should be fetched with Schema and Platform Instance information
Screenshots
Datahub Version 14.1
This is possible in the V2 UI on DataHub 1.0. We know it's a big switch, but we recommend swapping over to the new UI as it is being actively developed.
Thanks for the update @asikowitz
I am trying to reproduce the issue with the current UI. However, it appears that you need data sources to reproduce it. The initial Docker setup does not include the data sources. What is the easiest way to add the appropriate data sources to reproduce the issue? And once the data sources are added, is it possible to have a direct link to the bug to try to reproduce it?
You can ingest some sample data by running python -m datahub docker ingest-sample-data. If you are running datahub locally with authentication on (default), you'll have to generate a token and then specify it python -m datahub docker ingest-sample-data --token <token>.
To reproduce the bug, make sure you're on the V1 UI (may have to go to settings -> appearance -> unselect "Try DataHub 2.0 (beta)"), then go to any dataset entity (e.g. http://localhost:9002/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)/Columns), go to the lineage tab, and click edit upstream or downstream). To fully reproduce, you'll need to add data that has a platform instance which is missing from our sample data. But if you're able to get this to display the database / schema, think that is good enough.
How do I generate a token? When I go to the tokens page, it shows this below. Is there anything I should do to be able to create a token? Or do I generate it differently?
Ah if it's disabled then you don't need a token at all, you can just run python -m datahub docker ingest-sample-data
It looks like my issue with not seeing the same screen as shown in the reproduction steps was the default UI version of 2.0. I ran the data ingestion and changed it to 1.0 and am able to see the screen. Can I know which specific dataset I can choose from the dropdown to reproduce the issue? On step 4 it says to choose a dataset that "has multiple instances in different schemas or Platform instance".
There isn't an issue with a specific dataset. Rather, the complaint here is that entities with the same platform, entity type, and entity name are indistinguishable, and the request is to display extra information: platform instance and database / schema. We don't have such entities in the seed data, but you can still work on displaying this extra information in two places: (i) on the list of related entities and (ii) on the search cards when searching to add more related entities. We should display this like we do on the entity header, i.e. in your screenshot, towards the top left, we describe the entity as: "Dataset | Hive > datahub_db > datahub_schema".
I created a fix for this issue and have a PR for it. I tested it locally and it seems to work.
Here it is: https://github.com/datahub-project/datahub/pull/13856
Feel free to leave any comments.
Some of the checks are failing, including lint. This time, it does not give a specific lint error. I ran lint locally on the file I changed and I got no errors. I'm not sure how to fix the pipeline errors.
Hi @asikowitz ,
I submitted a PR a few days ago. It shows that the checks do not pass, but when I look closer, the one that is failing is the deployment task. The specific error shows this:
https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856
Run cloudflare/pages-action@1
with:
projectName: datahub-project-web-react
workingDirectory: datahub-web-react
directory: dist
gitHubToken: ***
wranglerVersion: [2](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:2)
env:
JAVA_HOME: /opt/hostedtoolcache/Java_Zulu_jdk/17.0.15-6/x6[4](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:4)
JAVA_HOME_17_X64: /opt/hostedtoolcache/Java_Zulu_jdk/17.0.1[5](https://github.com/datahub-project/datahub/actions/runs/15961135460/job/45013841641?pr=13856#step:6:5)-6/x64
GRADLE_BUILD_ACTION_SETUP_COMPLETED: true
GRADLE_BUILD_ACTION_CACHE_RESTORED: true
DEVELOCITY_INJECTION_INIT_SCRIPT_NAME: gradle-actions.inject-develocity.init.gradle
DEVELOCITY_AUTO_INJECTION_CUSTOM_VALUE: gradle-actions
GITHUB_DEPENDENCY_GRAPH_ENABLED: false
Error: Input required and not supplied: apiToken
However, all areas within my control, such as linting, unit tests, and unit test coverage, pass. I believe the deployment issue is outside of my control. It may be because I'm creating a PR from my forked branch.
Are you or someone from your team able to review my PR? https://github.com/datahub-project/datahub/pull/13856