llama-hub icon indicating copy to clipboard operation
llama-hub copied to clipboard

Discuss: Support Llamaindex connector for MeltanoHub, which has 500+ open source Singer data connectors

Open aaronsteers opened this issue 2 years ago • 1 comments

A generic interface into hub.meltano.com would be great. In that paradigm, the source connectors are called "extractors" or "taps".

There are a few different ways we could create generic connection interfaces, which I can highlight below...

Generally though, each connector would need:

  1. The connector {variant}/{name} string combo, and/or pip_url of the connector.
  2. The config info for the connector, which is generally passed as a JSON file, but which could be defined by users in a Python dictionary object or an array of key-value pairs.
  3. Optionally: the stream and property selection rules, either as a glob of inclusion/selection rules, or as a Singer "catalog" JSON artifact. Perhaps not needed in a V1, but these could let users pick and choose which datasets and/or properties they are interested in.
    • A simple "V1" MVP might ask for a single stream name, or an array of stream names.

An example:

tap-asana - Meltano Hub

Connector info:

  • name: tap-asana, variant: singer
  • metadata: https://github.com/meltano/hub/blob/main/_data/meltano/extractors/tap-asana/singer-io.yml or (raw)
  • pip_url: tap-asana

Sample config:

asana_config = {
    "client_id": os.environ.get("TAP_ASANA_CLIENT_ID"),
    "client_secret": os.environ.get("TAP_ASANA_CLIENT_SECRET"),
    "refresh_token": os.environ.get("TAP_ASANA_REFRESH_TOKEN"),
}

Processing Singer output

Singer outputs data as a series of json lines, generally one record which should be easy for the libraries to parse generically.

List of connectors:

https://hub.meltano.com/extractors

This isn't a full list, since many are being created that aren't already on the Hub, but it gives a good idea of the existing depth and breadth of the ecosystem.

How to list on LLama-Hub

To not spam the index, we could just list as a single item on the LlamaHub: either as "MeltanoHub Singer Taps", or "Singer Extractors" generically, or similar.

Thinking about the "right" abstraction layer

I think this could be really powerful, since it could plug in Llamaindex, Langchain, and other GPT-like applications into a broad ecosystem of already existing connectors.

Since the vast majority of Singer connectors are already pip installable, this should fit well with existing paradigms that Llamaindex is using.

I may have some cycles to contribute to this integration but I first wanted to log this issue here to assess interest level, and discuss if there are any potential pitfalls or "gotchas" that others might see.

aaronsteers avatar May 08 '23 20:05 aaronsteers

We'd be eager to chat about how we could help with this. The wider Singer community has put a lot of work into extractors and if we could make it easier to use them for LLM applications that'd be a huge win for everyone!

tayloramurphy avatar May 11 '23 21:05 tayloramurphy