datahub
datahub copied to clipboard
feat(entity-registry): Enforce Field Path V2 uniqueness check during ingestion
Adds check in GMS when processing schemaMetadata aspects that all field paths are unique.
If that is not the case, reject the MCP.
The reason for this validator is that the UI will not be able to render schemas correctly if field path ids are not unique.
Checklist
- [x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- [x] Links to related issues (if applicable)
- [x] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
Failing test seems unrelated:
Running: domains/nested_domains.js (8 of 30)
Estimated: 3 minutes, 5 seconds
Verify nested domains test functionalities
=== 2024-05-09T23:20:04.443Z start: Verify Create a new domain
=== 2024-05-09T23:20:12.587Z end: Verify Create a new domain
✓ Verify Create a new domain (8115ms)
=== 2024-05-09T23:20:12.648Z start: Verify Documentation tab by adding editing Description and adding link
=== 2024-05-09T23:20:27.524Z end: Verify Documentation tab by adding editing Description and adding link
✓ Verify Documentation tab by adding editing Description and adding link (14694ms)
=== 2024-05-09T23:20:27.662Z start: Verify Right side panel functionalities
=== 2024-05-09T23:20:44.947Z end: Verify Right side panel functionalities
✓ Verify Right side panel functionalities (17245ms)
=== 2024-05-09T23:20:45.076Z start: Verify Move domain parent level to root level
xxx 2024-05-09T23:20:59.[560](https://github.com/datahub-project/datahub/actions/runs/9024237858/job/24798972167?pr=10477#step:31:561)Z error
(Attempt 1 of 3) Verify Move domain parent level to root level
=== 2024-05-09T23:20:59.881Z end: Verify Move domain parent level to root level
=== 2024-05-09T23:21:00.041Z start: Verify Move domain parent level to root level
xxx 2024-05-09T23:21:14.531Z error
(Attempt 2 of 3) Verify Move domain parent level to root level
=== 2024-05-09T23:21:14.740Z end: Verify Move domain parent level to root level
=== 2024-05-09T23:21:15.022Z start: Verify Move domain parent level to root level
xxx 2024-05-09T23:21:30.316Z error
1) Verify Move domain parent level to root level
=== 2024-05-09T23:21:30.520Z end: Verify Move domain parent level to root level
=== 2024-05-09T23:21:30.716Z start: Verify Documentation tab by adding editing Description and adding link
=== 2024-05-09T23:21:47.372Z end: Verify Documentation tab by adding editing Description and adding link
✓ Verify Documentation tab by adding editing Description and adding link (16585ms)
=== 2024-05-09T23:21:47.494Z start: Verify Right side panel functionalities
=== 2024-05-09T23:22:07.371Z end: Verify Right side panel functionalities
✓ Verify Right side panel functionalities (19766ms)
=== 2024-05-09T23:22:07.529Z start: Verify Edit Domain Name
=== 2024-05-09T23:22:13.934Z end: Verify Edit Domain Name
✓ Verify Edit Domain Name (6294ms)
=== 2024-05-09T23:22:14.061Z start: Verify Remove the domain
=== 2024-05-09T23:22:22.513Z end: Verify Remove the domain
✓ Verify Remove the domain (8350ms)
=== 2024-05-09T23:22:22.815Z start: Verify Add and delete sub domain
=== 2024-05-09T23:22:38.122Z end: Verify Add and delete sub domain
✓ Verify Add and delete sub domain (15250ms)
=== 2024-05-09T23:22:38.274Z start: Verify entities tab with adding and deleting assets and performing some actions
=== 2024-05-09T23:23:05.593Z end: Verify entities tab with adding and deleting assets and performing some actions
✓ Verify entities tab with adding and deleting assets and performing some actions (27197ms)
9 passing (3m)
1 failing
1) Verify nested domains test functionalities
Verify Move domain parent level to root level:
AssertionError: 2024-05-09T23:21:30.313Z
Timed out retrying after 10000ms: Expected to find element: `prev`, but never found it. Queried from:
> cy.get([data-testid="domain-list-item"]).contains(Marketing)
AssertionError: Timed out retrying after 10000ms: Expected to find element: `prev`, but never found it. Queried from:
> cy.get([data-testid="domain-list-item"]).contains(Marketing)
at moveDomaintoParent (webpack:///./cypress/e2e/domains/nested_domains.js:42:5)
at Context.eval (webpack:///./cypress/e2e/domains/nested_domains.js:168:4)
```
Failing CI seems unrelated.
@david-leifker Let's get this one followed up on pls. Cheers!
There was some discussion around whether it was better to throw and exception here or to drop the duplicate field. I think the conclusion was to drop the duplicate field instead. @pedro93 - please let me know if I've mis-remember the conversation around this validator or not. If not, I'd close this PR.
There was some discussion around whether it was better to throw and exception here or to drop the duplicate field. I think the conclusion was to drop the duplicate field instead. @pedro93 - please let me know if I've mis-remember the conversation around this validator or not. If not, I'd close this PR.
I vaguely recall us wanting to throw an exception in the future after our connectors have been fixed. Silently dropping the duplicate fields was meant as a stop gap measure. @hsheth2 please confirm.
Based on our telemetry, ~0.1% of ingestion runs produce duplicate field paths
Among the ingestions that needed to drop at least one field, it's a fairly small fraction of assets are actually impacted. The looker/glue ones do merit looking into a bit more
Given that this is a fairly small problem for ingestion and that ingestion has protections in place now, we're clear to include the server-side validation in the 0.13.4 release.
Replaced by https://github.com/datahub-project/datahub/pull/11619