onyx icon indicating copy to clipboard operation
onyx copied to clipboard

feat: add GitHub Pages connector

Open melmathari opened this issue 5 months ago • 4 comments

GitHub Pages connector

Description

This PR introduces a new GitHub Pages connector and integrates it into both the backend and frontend of Onyx.

Test

  • ✅ Prettier applied on web files
  • ✅ Pre-commit hooks (black, reorder-python-imports, autoflake, ruff, prettier) all passed
  • ✅ mypy type checks passed on modified backend files

Demo

Watch the video

Related Issue / Claim

Closes #2282

Creating a GitHub PAT for the GitHub Pages connector

  1. Generate a fine-grained personal access token.
  2. Configure:
    • Token name: Onyx GitHub Pages
    • Expiration: No expiration (recommended for connectors)
    • Resource owner: user/org that owns the repo
    • Repository access: All repositories (or select specific repos)
  3. Permissions:
    • Contents → Read-only
    • Metadata → Read-only
  4. Copy and store the token securely.

Using the token in Onyx

  • In the GitHub Pages connector config, paste the PAT into the GitHub access token field.
  • Provide:
    • repo_owner (e.g. melmathari)
    • repo_name (e.g. GitHub-pages)
  • Save and validate the connector.

/claim #2282

  • [ ] This PR should be backported
  • [x] [Optional] Override Linear Check

Summary by cubic

Adds a GitHub Pages connector that indexes HTML/Markdown from a repo’s Pages site via the GitHub API and exposes it as a load-state connector in the app. Implements the flow requested in Linear #2282.

  • New Features

    • Backend GitHub Pages connector with checkpointing, rate-limit handling, and credential validation
    • Supports gh-pages, configured Pages branch, or default branch; converts repo paths to Pages URLs
    • Parses HTML/Markdown using existing file processing utilities; includes title extraction and metadata
    • New enum, factory mapping, and Slack icon for DocumentSource.GITHUB_PAGES
  • Frontend

    • New connector config with fields: repo_owner, repo_name; advanced option: include_readme
    • Uses existing GitHub access token credential template
    • Added icon, source metadata, types, and inclusion in load-state and auto-sync sources

melmathari avatar Sep 09 '25 18:09 melmathari

Someone is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Sep 09 '25 18:09 vercel[bot]

@Weves Open to feedback, appreciate you looking into this. I am not sure whether this PR covers all the requirements so I might need some assistance.

melmathari avatar Sep 10 '25 13:09 melmathari

@Weves fyi, appreciate your time.

melmathari avatar Oct 09 '25 07:10 melmathari

@Weves fyi, appreciate your time.

melmathari avatar Nov 06 '25 14:11 melmathari