datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Showing files as datasets in Datahub

Open Shaikh-Zainab opened this issue 2 years ago • 3 comments

Description This an issue of how file sources or datasets in general are rendered by DH. It looks like the containerization of sources is handled in DH UI, based on separators like '.', '/' etc. So if a file and folder structure was say FolderA contains files Sample1.txt and Sample2.txt, MDH would show it as FolderA>Sample1>txt and FolderA>Sample2>txt which translates to FolderA having two sub-folders SampleA and SampleB with txt as file names, with the extension coming as filename. This causes an issue if I am trying to ingest and build lineage where a source maybe a file and it not being represented correctly in DH. We are trying to build lineage and an upstream or downstream could be file residing on a server like Unix or Windows. Also per current design of datahub it seems we can have only one file in filesystem with a particular name. But in reality, we can have multiple files with same name residing in different paths within a filesystem.

Shaikh-Zainab avatar Jul 29 '22 09:07 Shaikh-Zainab

@Shaikh-Zainab If you emit a custom browse path aspect from your ingestion source code, you should be able to control the hierarchy so that the UI does not mint this on your behalf.

This doc should help you understand the browse paths aspect: https://datahubproject.io/docs/metadata-modeling/metadata-model/#browsepaths-aspect.

For example, to attach a custom browsepath to a 'Dataset' entity, you could write the following Python:

        browse_path = BrowsePathsClass(
            paths=["/powerbi/{}".format(self.__config.workspace_id)]
        )
        return MetadataChangeProposalWrapper(
            entityType="dataset",
            changeType="UPSERT",
            entityUrn="urn:li:dataset:(urn:li:dataPlatform:custom,MyFileName,PROD),
            aspectName="browsePaths",
            aspect=browse_path,
        )

jjoyce0510 avatar Aug 01 '22 22:08 jjoyce0510

Hi @jjoyce0510, Thanks for sharing above code example. I have tried creating custom browse using above example and although it solves the issue of '.' in file names, it has given rise to a new issue i.e it takes away the containerization of the path. e.g Scenario: File system is as below FolderA-> FolderD ->Sample.txt

When earlier I just emit a normal lineage mcp, it would show FolderA as container having FolderD as another container with Sample as a dataset. But now it just shows FolderA>FolderD>Sample.txt in browse paths (I have emitted browse path using mcp) but when I click on actual containers they are empty.

Shaikh-Zainab avatar Aug 04 '22 18:08 Shaikh-Zainab

Also in addition to above, there is bug in lineage UI which shows a file called sample.txt having name as only txt. code - lineage node

Issue Screenshot image

Shaikh-Zainab avatar Aug 09 '22 15:08 Shaikh-Zainab

Missing child containers issue - Screenshots image

image

Also tried removing spaces in string from browsepaths. Still no child containers are being build image

Shaikh-Zainab avatar Aug 10 '22 10:08 Shaikh-Zainab

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Sep 10 '22 02:09 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Oct 10 '22 02:10 github-actions[bot]