pipedream icon indicating copy to clipboard operation
pipedream copied to clipboard

[Components] google_cloud_document_ai

Open pipedream-component-development opened this issue 6 months ago • 1 comments

google_cloud_document_ai

URLs

  • https://cloud.google.com/document-ai/docs/reference/rest

Webhook Sources

new-processed-document-instant

Description

Emit new event when a document has been fully processed by a processor. Requires the processor ID and webhook configuration on the Google Cloud Document AI processor endpoint.

new-processor-instant

Description

Emit new event when a new document processor is created in the user's Google Cloud project. Requires project ID and service account credentials with appropriate permissions.

new-processor-version-instant

Description

Emit new event when a new version of a processor is created. Useful for tracking changes in document parsing configurations. Requires processor ID.

Actions

process-document

Description

Submit a document for processing using a specified processor. Requires processor ID and document content or Cloud Storage URI.

list-processor-types

Description

Retrieve available processor types that can be deployed within a Google Cloud project. Useful for dynamically populating processor options. Requires location and project ID.

create-processor

Description

Create a new processor instance in a specific location. Requires processor type, display name, project ID, and location.

Important.

Google Cloud Document AI SDK doesn't take credentials as a parameter directy, instead it's done throught he following mechanism:

  1. it will look for an environment variable named GOOGLE_APPLICATION_CREDENTIALS
  2. Said environment variable has to point to a path in the local system that contains a service account's keys with proper permissions for Document AI.
  3. When it comes to Pipedream it almost imply that for every workflow execution, a file under "/tmp" neeeds to be created with service account's keys, and resulting path needs to be the value of the environment variable.

I left a workflow example showing this in 1PW.

Addtional context: Google has made it difficult to use the Document AI product, basically producing guardails such that it is much easyly executed inside GCP. It provides documented way to use in other hyperscalers namely AWS and Azure, for on premises and other cloud providers you are left to use service account keys, whereby the SDK uses "application default credentials" which are the ones that expect the keys being pointed by some environment variable.

sergio-eliot-rodriguez avatar May 31 '25 03:05 sergio-eliot-rodriguez

@sergio-eliot-rodriguez Is there any other way to setup the authentication for these components? Using an environment variable works for handwritten code steps, but environment variables aren't available to prewritten actions & triggers. From our docs:

Image

Trying to use the SDK in actions results in this error:

Image

michelle0927 avatar Jun 04 '25 20:06 michelle0927

No I'm sorry, components will be blocked for now then. These are the only ways that Google supports authenticating to Document AI

  1. Be hosted on GCP, so Document AI runs in a trusted environment
  2. Be hosted on Azure or AWS, where GCP supports a trusted connection to those hyperscalers.
  3. For on premises or other cloud providers, you have to use service account keys, where by the SDK looks for creds in the environment variable.

This is the reference from Google Docs: https://cloud.google.com/docs/authentication/set-up-adc-on-premises

To create a service account key and make it available to ADC:

Create a service account with the roles your application needs, and a key for that service account, by following the instructions in Creating a service account key. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file that contains your credentials. This variable applies only to your current shell session, so if you open a new session, set the variable again.

Example: Linux or macOS

Example: Windows

Note: When you set the GOOGLE_APPLICATION_CREDENTIALS environment variable, ADC checks this location first, then checks other locations only if necessary.

sergio-eliot-rodriguez avatar Jun 05 '25 16:06 sergio-eliot-rodriguez

This is an alternate location where ADC checks for credentials, but they imply the use of environment variables:

A credential file created by using the gcloud auth application-default login command You can provide credentials to ADC by running the gcloud auth application-default login command. This command creates a JSON file containing the credentials you provide (either from your user account or from impersonating a service account) and places it in a well-known location on your file system. The location depends on your operating system:

Linux, macOS: $HOME/.config/gcloud/application_default_credentials.json Windows: %APPDATA%\gcloud\application_default_credentials.json

sergio-eliot-rodriguez avatar Jun 05 '25 16:06 sergio-eliot-rodriguez

And finally, apparently it is possible to attach a service account to the any GCP resource, and then the ADC will use some medata server to get permissions, there are changes that this unblocks the component development, so allow me sometime while I deep dive on this -- https://cloud.google.com/iam/docs/attach-service-accounts#attaching-to-resources

Search order ADC searches for credentials in the following locations:

GOOGLE_APPLICATION_CREDENTIALS environment variable A credential file created by using the gcloud auth application-default login command The attached service account, returned by the metadata server

Update: You can't attach a service account to any GCP resource, just some (see next comment)

sergio-eliot-rodriguez avatar Jun 05 '25 16:06 sergio-eliot-rodriguez

I checked, attaching a service account to Document AI is not an option, here is why:

According to the documentation in GCP, Attach service accounts to resources it says

For some Google Cloud resources, you can specify a user-managed service account that the resource uses as its default identity. This process is known as attaching the service account to the resource,

The key is "For some" GC resources. Which resources are available for attaching a service account?

Well, according to the documentation, these ones:

  • AI Platform Prediction
  • AI Platform Training
  • App Engine standard environment
  • App Engine flexible environment
  • Cloud Composer
  • Cloud Run functions
  • Cloud Life Sciences
  • Cloud Run
  • Cloud Scheduler
  • Cloud Source Repositories
  • Compute Engine
  • Dataproc
  • Google Kubernetes Engine
  • Notebooks
  • Pub/Sub

in Attach the service account to the new resource.

and Document AI is not a resource type that you can attach a service account to.

More discussion: When you setup a Document AI resource in GCP, it will be assigned automatically a built-in service account, so GCP won't let you attach a user-managed services account, it forces you to use the built-in one.

I tried it, when you remove the GOOGLE_APPLICATION_CREDENTIALS from my Pipedream workspace, the Document AI Node.JS SDK will error out:

Image

sergio-eliot-rodriguez avatar Jun 06 '25 09:06 sergio-eliot-rodriguez

Next steps.

The Document AI Ruby SDK, allows you to specify the json file where to look for service account keys:

Google::Cloud::DocumentAI.configure do |config| config.credentials = "path/to/keyfile.json" end

per this SO article.

However, the Document AI Node.JS SDK does not So let's see if eventually either the Node.JS SDK will let us specify the key file, or some other alternative.

sergio-eliot-rodriguez avatar Jun 06 '25 09:06 sergio-eliot-rodriguez