datahub icon indicating copy to clipboard operation
datahub copied to clipboard

fix(gms) Write back lineage search results to in-memory cache

Open piyushn-stripe opened this issue 2 years ago • 3 comments

Checklist

  • [ ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ ] Links to related issues (if applicable)
  • [ ] Tests for the changes have been added/updated (if applicable)
  • [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

While debugging why our calls to searchAcrossLineage were taking 10s+ I noticed that the bulk of our time was being spent on the expensive graphService.getLineage() call (we use ES as our graph store). I added some logs and noticed that we were seeing 0 cache hits and always ended up with lineageResult == null. It seems like we're missing a cache write back when we retrieve the lineage.

Tested this against a GMS instance and I do see the latencies drop for repeated calls:

$ time curl --location --request POST 'http://localhost:8080/api/graphql' \
> --header 'Content-Type: application/json' \
> --data-raw '{ "query":"query getDatasetUpstreams($urn: String!) { searchAcrossLineage(input: {urn: $urn, direction: UPSTREAM, count: 1000, types:DATA_JOB}) { total searchResults { degree entity { type urn } } } }", "variables":{"urn" : "urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelySparkTaskQA)"}}'
{"data":{"searchAcrossLineage":{"total":3,"searchResults":[{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),ExternalTaskSensor.data_platform_minutely.data_platform_minutely.airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":2,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),data_platform_minutely.locations.DataSourceLocations-Requiredairflow_functional_tests.MinutelySparkTaskQADataSource_airflow_functional_tests.MinutelyPythonTaskQA-airflow_functional_tests.MinutelySparkTaskQA)"}}]}}}
real	0m5.270s
user	0m0.009s
sys	0m0.008s

$ time curl --location --request POST 'http://localhost:8080/api/graphql' --header 'Content-Type: application/json' --data-raw '{ "query":"query getDatasetUpstreams($urn: String!) { searchAcrossLineage(input: {urn: $urn, direction: UPSTREAM, count: 1000, types:DATA_JOB}) { total searchResults { degree entity { type urn } } } }", "variables":{"urn" : "urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelySparkTaskQA)"}}'
{"data":{"searchAcrossLineage":{"total":3,"searchResults":[{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),ExternalTaskSensor.data_platform_minutely.data_platform_minutely.airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":2,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),airflow_functional_tests.MinutelyPythonTaskQA)"}},{"degree":1,"entity":{"type":"DATA_JOB","urn":"urn:li:dataJob:(urn:li:dataFlow:(airflow,data_platform_minutely,QA),data_platform_minutely.locations.DataSourceLocations-Requiredairflow_functional_tests.MinutelySparkTaskQADataSource_airflow_functional_tests.MinutelyPythonTaskQA-airflow_functional_tests.MinutelySparkTaskQA)"}}]}}}
real	0m0.101s
user	0m0.005s
sys	0m0.005s

Prior to this every call for the same URN took a few seconds.

piyushn-stripe avatar Sep 09 '22 17:09 piyushn-stripe

Unit Test Results (build & test)

571 tests  ±0   571 :heavy_check_mark: ±0   14m 27s :stopwatch: -1s 141 suites ±0       0 :zzz: ±0  141 files   ±0       0 :x: ±0 

Results for commit 2a8dd022. ± Comparison against base commit 5303ecbd.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Sep 09 '22 18:09 github-actions[bot]

@piyushn-stripe Can you enable pushes to your branch? We would like to add a feature flag around this so that users aren't forced into caching where a delay may be seen on updates. I have the changes written up, but am getting an error when attempting to push. There should be a checkbox to allow maintainers to push updates.

RyanHolstien avatar Sep 21 '22 00:09 RyanHolstien

@RyanHolstien - trying to find where this setting is, don't seem to be seeing it on my fork settings. I have checked the setting while creating my PR though:

If checked, users with write access to datahub-project/datahub can add new commits to your piyushn/fix_lineage_search_caching branch. You can always change this setting later.

Feel free to just spin up a PR if that's easier. It's a 1-line change so you could just spin up something from scratch and I can close this out.

piyushn-stripe avatar Sep 21 '22 01:09 piyushn-stripe

closing as we raised another PR #6006 and merged this in.

shirshanka avatar Sep 26 '22 06:09 shirshanka