datahub
datahub copied to clipboard
fix(gms) Write back lineage search results to in-memory cache
Checklist
- [ ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
While debugging why our calls to searchAcrossLineage
were taking 10s+ I noticed that the bulk of our time was being spent on the expensive graphService.getLineage()
call (we use ES as our graph store). I added some logs and noticed that we were seeing 0 cache hits and always ended up with lineageResult
== null. It seems like we're missing a cache write back when we retrieve the lineage.
Tested this against a GMS instance and I do see the latencies drop for repeated calls:
$ time curl --location --request POST 'http://localhost:8080/api/graphql' \
> --header 'Content-Type: application/json' \
> --data-raw '{ "query":"query getDatasetUpstreams($urn: String!) { searchAcrossLineage(input: {urn: $urn, direction: UPSTREAM, count: 1000, types:DATA_JOB}) { total searchResults { degree entity { type urn } } } }", "variables":{"urn" : "urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelySparkTaskQA)"}}'
{"data":{"searchAcrossLineage":{"total":3,"searchResults":[{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),ExternalTaskSensor.data_platform_minutely.data_platform_minutely.airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":2,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),data_platform_minutely.locations.DataSourceLocations-Requiredairflow_functional_tests.MinutelySparkTaskQADataSource_airflow_functional_tests.MinutelyPythonTaskQA-airflow_functional_tests.MinutelySparkTaskQA)"}}]}}}
real 0m5.270s
user 0m0.009s
sys 0m0.008s
$ time curl --location --request POST 'http://localhost:8080/api/graphql' --header 'Content-Type: application/json' --data-raw '{ "query":"query getDatasetUpstreams($urn: String!) { searchAcrossLineage(input: {urn: $urn, direction: UPSTREAM, count: 1000, types:DATA_JOB}) { total searchResults { degree entity { type urn } } } }", "variables":{"urn" : "urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelySparkTaskQA)"}}'
{"data":{"searchAcrossLineage":{"total":3,"searchResults":[{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),ExternalTaskSensor.data_platform_minutely.data_platform_minutely.airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":2,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),data_platform_minutely.locations.DataSourceLocations-Requiredairflow_functional_tests.MinutelySparkTaskQADataSource_airflow_functional_tests.MinutelyPythonTaskQA-airflow_functional_tests.MinutelySparkTaskQA)"}}]}}}
real 0m0.101s
user 0m0.005s
sys 0m0.005s
Prior to this every call for the same URN took a few seconds.
Unit Test Results (build & test)
571 tests ±0 571 :heavy_check_mark: ±0 14m 27s :stopwatch: -1s 141 suites ±0 0 :zzz: ±0 141 files ±0 0 :x: ±0
Results for commit 2a8dd022. ± Comparison against base commit 5303ecbd.
:recycle: This comment has been updated with latest results.
@piyushn-stripe Can you enable pushes to your branch? We would like to add a feature flag around this so that users aren't forced into caching where a delay may be seen on updates. I have the changes written up, but am getting an error when attempting to push. There should be a checkbox to allow maintainers to push updates.
@RyanHolstien - trying to find where this setting is, don't seem to be seeing it on my fork settings. I have checked the setting while creating my PR though:
If checked, users with write access to datahub-project/datahub can add new commits to your piyushn/fix_lineage_search_caching branch. You can always change this setting later.
Feel free to just spin up a PR if that's easier. It's a 1-line change so you could just spin up something from scratch and I can close this out.
closing as we raised another PR #6006 and merged this in.