opal icon indicating copy to clipboard operation
opal copied to clipboard

More data fetcher providers wanted

Open asafc opened this issue 3 years ago • 3 comments

What are fetch providers?

FetchProviders are the components OPAL uses to fetch data from sources on demand.

Fetch providers are designed to be extendable, and you can easily create more fetch providers to enable OPAL to fetch data from your own unique sources (e.g. a SaaS service, a new DB, your own proprietary solution, ...)

We have a guide on how to write new fetch providers

We have a fully-working example fetch provider for Postgres

Help wanted

More data fetch providers are planned, but we would love the community to help and take an active part by developing them.

Ideas for missing providers: DBs from any kind (i.e: postgres, mongo, etc), popular 3rd party SaaS providers (i.e: stripe, salesforce, etc)

Reach out to us, we are happy to guide you in the process.

asafc avatar Apr 14 '21 21:04 asafc

While writing my own fetch providers, I'm having trouble creating anything that is agnostic of the actual data schema.

The problem is this data has to be transformed into something that is more useful and performant for OPA: Number of data updates is minuscule compared to number of queries, so its important the data in OPA is as optimized as possible for the policies, a prime example being keyed access into objects instead of arrays.

The best I came up with was to use different json transformation libraries which execute based on some dynamic ruleset, instead of hardcoding the transformations. But Im concerned this would have a significant impact on performance for larger sets of data -- yet to verify.

Any ideas?

jake1098 avatar Oct 19 '21 05:10 jake1098

Hey @jyoussefzadeh, that's a valid concern!

In general:

  • Communication for OPA is read-heavy - so you are right you'd want to make sure the data is optimized for reads.
  • Since OPAL is asyncio-based (runs on a reactor-model) you are right to be concerned that a cpu-bound task can jam the reactor (delay queued tasks).
  • Regarding specific libraries - i heard good things about pyjq for transformation and about simdjson for performance. But i did not use them myself.

OPAL has two types of data updates:

  • initial/complete data-sources (OPAL_DATA_CONFIG_SOURCES ) - invoked by fresh empty-cached OPAL clients or after network disconnects.
  • dynamic (delta) updates sent via the publish updates api.

Performance concerns:

  • For dynamic delta updates i would not be concerned of the data size - usually not a lot is changing in a single update. Your json transform should finish very quickly for small amounts of data.
  • For initial/complete data-sources (OPAL_DATA_CONFIG_SOURCES) i would be a bit more concerned for the volume of data. You might have tons of users (good for you :)) and they might have a lot of authorization-supporting data.

I think you have a trade-off and maybe a valid hybrid approach:

  • Use a custom fetcher with cpu-heavy json transformations for delta updates.
  • For the initial data sources, i would maybe use the built-in http fetcher, and create my own API proxy-service that caches and transform the raw data from actual sources. Thus OPAL client will not block on huge data transforms.

asafc avatar Oct 19 '21 10:10 asafc

Adding on @asafc's great answer - I think it makes a lot of sense and is definitely okay to have proprietary FetchProviders that aren't generic.

One of the main reasons we chose Python here is to make writing extensions like FetchProviders super easy; so it will be applicable even for niche cases.

That said, I can suggest two ways to help you make your FetchProviders more generic:

  1. Use the FetcherConfig (i.e. event.config) to pass the proprietary schema you need as part of the event. You can have any JSON data you want passed as part of the config field of the FetchEvent; and use it to pass instructions to your generic code to handle the data format - e.g. a schema or template.

  2. Use Inheritance. Polymorphism in Python is really easy and stable - so you can easily create a generic fetch provider - and then derive from it to create a proprietary one. For example you can have a generic RedisFetchProvider that enable you to read data from any Redis DB, and then have a specific MySpecialRedisFetchProvider` that uses the methods provided by the parent class to read specific predefined values from Redis and format them in a specifc way.

Here's an example combining both methods - by inheriting from the built-in HttpFetchProvider and extending it to also receive the end format for OPA as a string template as part of the event config. (Note I didn't test this code)

class HttpTemplateConfig(HttpFetcherConfig):
    """
    Config for HttpTemplateFetchProvider-s Adding a string template
    """
    template: str = None

class HttpTemplateFetchEvent(FetchEvent):
    fetcher: str = "HttpTemplateFetchEvent"
    config: HttpTemplateConfig = None

class HttpTemplateFetchProvider(HttpFetchProvider):

    def __init__(self, event: HttpFetchEvent) -> None:
        self._event: HttpTemplateFetchEvent
        if event.config is None:
            event.config = HttpTemplateConfig()
        super().__init__(event)


    async def _process_(self, res: ClientResponse):
        # get the data from the HTTP response
        data = super()._process_(res)
        # render the template (note: should be a valid JSON object after the rendering)
        return self._event.config.template.format(**data) 

orweis avatar Oct 19 '21 11:10 orweis