whale icon indicating copy to clipboard operation
whale copied to clipboard

Using the Amundsen's APIs as a bridge rather than interacting with Neo4J?

Open hashhar opened this issue 4 years ago • 5 comments

Amundsen's APIs (specifically metadataservice) might be a good idea to integrate as a backend since it abstracts the Atlas/Neo4J/other backends and will not depend on the schema of the backend data.

If the goal is to be able to run completely offline with a local copy of the backing data then I can understand that too, but if it's not a concern to depend on connectivity/access to Amundsen then it might be worthwhile.

hashhar avatar Jun 04 '20 17:06 hashhar

Thanks for the suggestion, @hashhar - that's super interesting! My only hesitation with going in this direction to start was that I didn't want to heavily clog the metadata endpoint, but I'll scope this out and let you know. Are you guys on Atlas?

rsyi avatar Jun 05 '20 02:06 rsyi

Yes @rsyi, I'm using the Atlas backend. I haven't yet looked into the code but I am assuming with the existing code you somewhat dump all data from remote Neo4j to a local Neo4j instance using data builder.

If my assumption is correct then you are right that this would either mean each interaction will need to make a call to metadata service or that you'll need to effectively re-implement a data store and queries for the metadata API responses.

I think another possible way to handle this is to instruct people to write their own data builder jobs to do the same with Atlas as you are doing for Neo4j but I'll need to check if exporting and importing is possible via Atlas or not. Maybe folks at ING WBAA might have an idea. Maybe you can accept PRs implementing the data builder jobs for whatever backends people want to implement.

Not sure which approach makes sense though.

hashhar avatar Jun 05 '20 02:06 hashhar

The code actually doesn't dump into a local neo4j instance (it stores all metadata as text), but your point is otherwise right on the money! Because I'm storing the data locally and searching over it there, I can't go through metadataservice to access the data for each table -- I have to dump it all.

I went with this architecture primarily for:

  1. Speed: hitting the search service and metadata endpoints are slower than searching over a local directory (and also unavailable, offline)
  2. And flexibility: while I like amundsen, I want metaframe to be able to access databases directly, in case amundsen is living in a hard-to-access walled garden, or users aren't using amundsen.

And I'm very open to contributions! I'm currently in the process of writing docs explaining how to create a more extensive tutorial, and I have a rough draft here: https://docs.metaframe.sh/custom-etl In short, any Extractor object that returns a TableMetadata object is really easy to slot in.

I took a quick look at the metadata endpoints, and it actually doesn't seem too bad. But if you (or anyone) wants to give this a try, I'd be happy to help out/walk you through the code. :)

rsyi avatar Jun 05 '20 04:06 rsyi

I'll be able to look at this over the next weekend. I think it's much better to write an extractor for Atlas rather than Amundsen since people using Atlas without Amundsen will also get the feature for free.

The initial dump into text files via Metadata service might also not be feasible for even moderately large catalogs.

hashhar avatar Jun 05 '20 11:06 hashhar

Awesome! Let me know if you need any help/clarity. You could even just DM me on the amundsen slack. Happy to talk there as well.

rsyi avatar Jun 05 '20 16:06 rsyi