DeepAudit icon indicating copy to clipboard operation
DeepAudit copied to clipboard

Feat/gitea support

Open vinland100 opened this issue 2 weeks ago • 7 comments

User description

gitea support


PR Type

Enhancement


Description

  • Add comprehensive Gitea repository support alongside GitHub and GitLab

    • New get_gitea_branches() and get_gitea_files() functions for API integration
    • Gitea token configuration in backend settings and user config
    • Branch fetching and file retrieval with proper authentication
  • Update frontend to support Gitea as repository platform option

    • Add Gitea to repository type selections and platform constants
    • Implement Gitea-specific UI styling and icons
    • Update system configuration to manage Gitea tokens
  • Improve repository URL parsing and branch fallback mechanism

    • Refactor .git suffix removal logic for consistency across platforms
    • Enhance branch discovery with fallback to main/master branches
  • Update project configuration and documentation

    • Add Gitea token environment variable to backend config
    • Update README and setup instructions for Gitea support
    • Improve Docker and frontend build configuration

Diagram Walkthrough

flowchart LR
  A["Gitea Repository"] -->|"API v1"| B["gitea_api()"]
  B -->|"Get Branches"| C["get_gitea_branches()"]
  B -->|"Get Files"| D["get_gitea_files()"]
  C -->|"Branch List"| E["Project Branches Endpoint"]
  D -->|"File List"| F["Scan Task"]
  G["Gitea Token Config"] -->|"Backend Settings"| B
  G -->|"User Config"| B
  H["Frontend UI"] -->|"Select Gitea"| I["Create/Edit Project"]
  I -->|"Repository Type"| J["Backend Processing"]

File Walkthrough

Relevant files
Enhancement
9 files
projects.py
Add Gitea token handling and branch retrieval                       
+11/-5   
project.py
Update repository type to include gitea                                   
+1/-1     
scanner.py
Implement Gitea API integration and file retrieval             
+125/-21
projectTypes.ts
Add Gitea to repository platform options                                 
+17/-15 
index.ts
Add gitea to RepositoryPlatform type union                             
+1/-1     
projectUtils.ts
Add Gitea platform label mapping                                                 
+1/-0     
Dockerfile
Implement build-time placeholder for API URL                         
+7/-2     
SystemConfig.tsx
Add Gitea token input field to system config                         
+22/-4   
Projects.tsx
Add Gitea option to project creation and editing                 
+3/-0     
Configuration changes
5 files
config.py
Add GITEA_TOKEN configuration setting                                       
+3/-0     
.python-version
Downgrade Python version from 3.13 to 3.12                             
+1/-1     
pyproject.toml
Update UV dependency configuration format                               
+2/-2     
docker-compose.yml
Fix formatting and add frontend API URL environment           
+6/-4     
.env.example
Update default API base URL to /api/v1                                     
+1/-1     
Bug fix
1 files
docker-entrypoint.sh
Fix API URL injection for Docker deployment                           
+7/-2     
Documentation
2 files
env.example
Add GITEA_TOKEN environment variable documentation             
+5/-0     
README.md
Update documentation to mention Gitea support                       
+20/-5   

vinland100 avatar Dec 16 '25 10:12 vinland100

@vinland100 is attempting to deploy a commit to the tsinghuaiiilove-2257's projects Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Dec 16 '25 10:12 vercel[bot]

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🔴
SSRF vulnerability

Description: The get_gitea_branches() function constructs API URLs using unsanitized user input from
repo_url without proper validation of the URL scheme or domain, potentially enabling SSRF
attacks to internal services.
scanner.py [173-192]

Referred Code
async def get_gitea_branches(repo_url: str, token: str = None) -> List[str]:
    """获取Gitea仓库分支列表"""
    parsed = urlparse(repo_url)
    base = f"{parsed.scheme}://{parsed.netloc}"

    # 提取Owner和Repo: path通常是 /owner/repo.git 或 /owner/repo
    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    if len(parts) < 2:
         raise Exception("Gitea 仓库 URL 格式错误")

    owner, repo = parts[0], parts[1]

    branches_url = f"{base}/api/v1/repos/{owner}/{repo}/branches"
    branches_data = await gitea_api(branches_url, token)

    return [b["name"] for b in branches_data]

Token injection risk

Description: The gitea_api() function uses user-controlled token in Authorization header without
validation, potentially allowing token injection or unauthorized API access if token
contains malicious content.
scanner.py [102-118]

Referred Code
async def gitea_api(url: str, token: str = None) -> Any:
    """调用Gitea API"""
    headers = {"Content-Type": "application/json"}
    t = token or settings.GITEA_TOKEN
    if t:
        headers["Authorization"] = f"token {t}"

    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.get(url, headers=headers)
        if response.status_code == 401:
            raise Exception("Gitea API 401:请配置 GITEA_TOKEN 或确认仓库权限")
        if response.status_code == 403:
            raise Exception("Gitea API 403:请确认仓库权限/频率限制")
        if response.status_code != 200:
            raise Exception(f"Gitea API {response.status_code}: {url}")
        return response.json()

Credential exposure risk

Description: Gitea token is retrieved from encrypted user config and system settings without proper
validation before use, potentially exposing sensitive credentials through logs or error
messages if decryption fails silently.
projects.py [662-677]

Referred Code
projects_gitea_token = settings.GITEA_TOKEN

SENSITIVE_OTHER_FIELDS = ['githubToken', 'gitlabToken', 'giteaToken']

if config and config.other_config:
    import json
    other_config = json.loads(config.other_config)
    for field in SENSITIVE_OTHER_FIELDS:
        if field in other_config and other_config[field]:
            decrypted_val = decrypt_sensitive_data(other_config[field])
            if field == 'githubToken':
                github_token = decrypted_val
            elif field == 'gitlabToken':
                gitlab_token = decrypted_val
            elif field == 'giteaToken':
                projects_gitea_token = decrypted_val
Path traversal risk

Description: The get_gitea_files() function constructs file URLs using unsanitized path components from
API responses without validation, potentially allowing path traversal or access to
unauthorized files if the Gitea API response is compromised.
scanner.py [290-329]

Referred Code
async def get_gitea_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
    """获取Gitea仓库文件列表"""
    parsed = urlparse(repo_url)
    base = f"{parsed.scheme}://{parsed.netloc}"

    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    if len(parts) < 2:
         raise Exception("Gitea 仓库 URL 格式错误")

    owner, repo = parts[0], parts[1]

    # Gitea tree API: GET /repos/{owner}/{repo}/git/trees/{sha}?recursive=1
    # 可以直接使用分支名作为sha
    tree_url = f"{base}/api/v1/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
    tree_data = await gitea_api(tree_url, token)

    files = []
    for item in tree_data.get("tree", []):


 ... (clipped 19 lines)
Credential leakage risk

Description: Gitea token is added to HTTP headers without proper sanitization or validation, and the
token is passed through file_info dictionary which could be logged or exposed in error
messages, potentially leaking credentials.
scanner.py [483-497]

Referred Code

if repo_type == "gitlab":
     token_to_use = extracted_token or gitlab_token
     if token_to_use:
         headers["PRIVATE-TOKEN"] = token_to_use
elif repo_type == "gitea":
     token_to_use = extracted_token or gitea_token
     if token_to_use:
         headers["Authorization"] = f"token {token_to_use}"
elif repo_type == "github":
     # GitHub raw URL 也是直接下载,通常public不需要token,private需要
     # GitHub raw user content url: raw.githubusercontent.com
     if github_token:
         headers["Authorization"] = f"Bearer {github_token}"

Ticket Compliance
🎫 No ticket provided
  • [ ] Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status:
Generic variable name: Variable projects_gitea_token uses inconsistent naming compared to github_token and
gitlab_token (missing underscore prefix pattern).

Referred Code

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Token exposure risk: Log statement prints token configuration status which could indirectly reveal sensitive
information about authentication setup.

Referred Code
print(f"[Branch] GitHub Token: {'已配置' if github_token else '未配置'}, GitLab Token: {'已配置' if gitlab_token else '未配置'}, Gitea Token: {'已配置' if projects_gitea_token else '未配置'}")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Missing null validation: Functions get_gitea_branches() and get_gitea_files() do not explicitly validate if token
parameter is None before using it in API calls, which may cause issues if token
configuration is missing.

Referred Code
async def get_gitea_branches(repo_url: str, token: str = None) -> List[str]:
    """获取Gitea仓库分支列表"""
    parsed = urlparse(repo_url)
    base = f"{parsed.scheme}://{parsed.netloc}"

    # 提取Owner和Repo: path通常是 /owner/repo.git 或 /owner/repo
    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    if len(parts) < 2:
         raise Exception("Gitea 仓库 URL 格式错误")

    owner, repo = parts[0], parts[1]

    branches_url = f"{base}/api/v1/repos/{owner}/{repo}/branches"
    branches_data = await gitea_api(branches_url, token)

    return [b["name"] for b in branches_data]

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
URL parsing validation: The get_gitea_branches() and get_gitea_files() functions parse repository URLs but lack
comprehensive validation of URL structure and components before constructing API calls.

Referred Code
async def get_gitea_branches(repo_url: str, token: str = None) -> List[str]:
    """获取Gitea仓库分支列表"""
    parsed = urlparse(repo_url)
    base = f"{parsed.scheme}://{parsed.netloc}"

    # 提取Owner和Repo: path通常是 /owner/repo.git 或 /owner/repo
    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    if len(parts) < 2:
         raise Exception("Gitea 仓库 URL 格式错误")

    owner, repo = parts[0], parts[1]

    branches_url = f"{base}/api/v1/repos/{owner}/{repo}/branches"
    branches_data = await gitea_api(branches_url, token)

    return [b["name"] for b in branches_data]

Learn more about managing compliance generic rules or creating your own custom rules

  • [ ] Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix undefined variable in token handling
Suggestion Impact:The commit directly implements the suggested fix by replacing 'extracted_token' with 'file_info.get('token')' on lines 193 and 198 for both GitLab and Gitea token handling, exactly as suggested.

code diff:

                     if repo_type == "gitlab":
-                         token_to_use = extracted_token or gitlab_token
+                         token_to_use = file_info.get('token') or gitlab_token
                          if token_to_use:
                              headers["PRIVATE-TOKEN"] = token_to_use
                     elif repo_type == "gitea":
-                         token_to_use = extracted_token or gitea_token
+                         token_to_use = file_info.get('token') or gitea_token
                          if token_to_use:
                              headers["Authorization"] = f"token {token_to_use}"

Fix a NameError by replacing the undefined extracted_token variable with
file_info.get('token') to correctly retrieve the token for API requests.

backend/app/services/scanner.py [484-491]

 if repo_type == "gitlab":
-     token_to_use = extracted_token or gitlab_token
+     token_to_use = file_info.get('token') or gitlab_token
      if token_to_use:
          headers["PRIVATE-TOKEN"] = token_to_use
 elif repo_type == "gitea":
-     token_to_use = extracted_token or gitea_token
+     token_to_use = file_info.get('token') or gitea_token
      if token_to_use:
          headers["Authorization"] = f"token {token_to_use}"

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a NameError bug introduced in the PR where an undefined variable extracted_token is used, and provides the correct fix by using file_info.get('token').

High
Escape special characters in environment variable
Suggestion Impact:The suggestion was directly implemented in the commit. The code now escapes special characters (&, /, |) in the API_URL variable before using it in the sed command, exactly as suggested.

code diff:

+ESCAPED_API_URL=$(echo "${API_URL}" | sed 's/[&/|]/\\&/g')
+find /usr/share/nginx/html -name '*.js' -exec sed -i "s|__API_BASE_URL__|${ESCAPED_API_URL}|g" {} \;

Escape special characters in the API_URL variable before using it in the sed
command to prevent potential script failures.

frontend/docker-entrypoint.sh [10-12]

 # 在所有 JS 文件中替换占位符
 # 注意:这里路径必须是 nginx 实际存放文件的路径
-find /usr/share/nginx/html -name '*.js' -exec sed -i "s|__API_BASE_URL__|${API_URL}|g" {} \;
+ESCAPED_API_URL=$(echo "${API_URL}" | sed 's/[&/|]/\\&/g')
+find /usr/share/nginx/html -name '*.js' -exec sed -i "s|__API_BASE_URL__|${ESCAPED_API_URL}|g" {} \;

[Suggestion processed]

Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential issue where special characters in the API_URL could break the sed command and provides a robust solution to prevent it.

Low
High-level
Refactor repository URL parsing logic
Suggestion Impact:The commit directly implements the suggested refactoring by: 1. Creating a new utility function `parse_repository_url` (imported from `app.utils.repo_utils`) 2. Replacing all duplicated URL parsing code in `get_github_branches`, `get_github_files`, `get_gitea_branches`, `get_gitea_files`, `get_gitlab_branches`, and `get_gitlab_files` functions 3. Using the unified function to extract owner, repo, base_url, and project_path information 4. Removing the manual string splitting and urlparse logic that was duplicated across all these functions

code diff:

+from app.utils.repo_utils import parse_repository_url
 from app.models.audit import AuditTask, AuditIssue
 from app.models.project import Project
 from app.services.llm.service import LLMService
@@ -149,17 +150,8 @@
 
 async def get_github_branches(repo_url: str, token: str = None) -> List[str]:
     """获取GitHub仓库分支列表"""
-    match = repo_url.rstrip('/')
-    if match.endswith('.git'):
-        match = match[:-4]
-    if 'github.com/' in match:
-        parts = match.split('github.com/')[-1].split('/')
-        if len(parts) >= 2:
-            owner, repo = parts[0], parts[1]
-        else:
-            raise Exception("GitHub 仓库 URL 格式错误")
-    else:
-        raise Exception("GitHub 仓库 URL 格式错误")
+    repo_info = parse_repository_url(repo_url, "github")
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     branches_url = f"https://api.github.com/repos/{owner}/{repo}/branches?per_page=100"
     branches_data = await github_api(branches_url, token)
@@ -172,20 +164,11 @@
 
 async def get_gitea_branches(repo_url: str, token: str = None) -> List[str]:
     """获取Gitea仓库分支列表"""
-    parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
-    
-    # 提取Owner和Repo: path通常是 /owner/repo.git 或 /owner/repo
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    parts = path.split('/')
-    if len(parts) < 2:
-         raise Exception("Gitea 仓库 URL 格式错误")
-    
-    owner, repo = parts[0], parts[1]
-    
-    branches_url = f"{base}/api/v1/repos/{owner}/{repo}/branches"
+    repo_info = parse_repository_url(repo_url, "gitea")
+    base_url = repo_info['base_url'] # This is {base}/api/v1
+    owner, repo = repo_info['owner'], repo_info['repo']
+    
+    branches_url = f"{base_url}/repos/{owner}/{repo}/branches"
     branches_data = await gitea_api(branches_url, token)
     
     return [b["name"] for b in branches_data]
@@ -194,7 +177,6 @@
 async def get_gitlab_branches(repo_url: str, token: str = None) -> List[str]:
     """获取GitLab仓库分支列表"""
     parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
     
     extracted_token = token
     if parsed.username:
@@ -203,14 +185,11 @@
         elif parsed.username and not parsed.password:
             extracted_token = parsed.username
     
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    if not path:
-        raise Exception("GitLab 仓库 URL 格式错误")
-    
-    project_path = quote(path, safe='')
-    branches_url = f"{base}/api/v4/projects/{project_path}/repository/branches?per_page=100"
+    repo_info = parse_repository_url(repo_url, "gitlab")
+    base_url = repo_info['base_url']
+    project_path = quote(repo_info['project_path'], safe='')
+    
+    branches_url = f"{base_url}/projects/{project_path}/repository/branches?per_page=100"
     branches_data = await gitlab_api(branches_url, extracted_token)
     
     return [b["name"] for b in branches_data]
@@ -219,17 +198,8 @@
 async def get_github_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取GitHub仓库文件列表"""
     # 解析仓库URL
-    match = repo_url.rstrip('/')
-    if match.endswith('.git'):
-        match = match[:-4]
-    if 'github.com/' in match:
-        parts = match.split('github.com/')[-1].split('/')
-        if len(parts) >= 2:
-            owner, repo = parts[0], parts[1]
-        else:
-            raise Exception("GitHub 仓库 URL 格式错误")
-    else:
-        raise Exception("GitHub 仓库 URL 格式错误")
+    repo_info = parse_repository_url(repo_url, "github")
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     # 获取仓库文件树
     tree_url = f"https://api.github.com/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
@@ -251,7 +221,6 @@
 async def get_gitlab_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取GitLab仓库文件列表"""
     parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
     
     # 从URL中提取token(如果存在)
     extracted_token = token
@@ -262,16 +231,12 @@
             extracted_token = parsed.username
     
     # 解析项目路径
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    if not path:
-        raise Exception("GitLab 仓库 URL 格式错误")
-    
-    project_path = quote(path, safe='')
+    repo_info = parse_repository_url(repo_url, "gitlab")
+    base_url = repo_info['base_url'] # {base}/api/v4
+    project_path = quote(repo_info['project_path'], safe='')
     
     # 获取仓库文件树
-    tree_url = f"{base}/api/v4/projects/{project_path}/repository/tree?ref={quote(branch)}&recursive=true&per_page=100"
+    tree_url = f"{base_url}/projects/{project_path}/repository/tree?ref={quote(branch)}&recursive=true&per_page=100"
     tree_data = await gitlab_api(tree_url, extracted_token)
     
     files = []
@@ -279,7 +244,7 @@
         if item.get("type") == "blob" and is_text_file(item["path"]) and not should_exclude(item["path"], exclude_patterns):
             files.append({
                 "path": item["path"],
-                "url": f"{base}/api/v4/projects/{project_path}/repository/files/{quote(item['path'], safe='')}/raw?ref={quote(branch)}",
+                "url": f"{base_url}/projects/{project_path}/repository/files/{quote(item['path'], safe='')}/raw?ref={quote(branch)}",
                 "token": extracted_token
             })
     
@@ -289,40 +254,23 @@
 
 async def get_gitea_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取Gitea仓库文件列表"""
-    parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
-    
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    parts = path.split('/')
-    if len(parts) < 2:
-         raise Exception("Gitea 仓库 URL 格式错误")
-    
-    owner, repo = parts[0], parts[1]
+    repo_info = parse_repository_url(repo_url, "gitea")
+    base_url = repo_info['base_url']
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     # Gitea tree API: GET /repos/{owner}/{repo}/git/trees/{sha}?recursive=1
     # 可以直接使用分支名作为sha
-    tree_url = f"{base}/api/v1/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
+    tree_url = f"{base_url}/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
     tree_data = await gitea_api(tree_url, token)
     
     files = []
     for item in tree_data.get("tree", []):
          # Gitea API returns 'type': 'blob' for files
         if item.get("type") == "blob" and is_text_file(item["path"]) and not should_exclude(item["path"], exclude_patterns):
-             # Gitea raw file URL: {base}/{owner}/{repo}/raw/branch/{branch}/{path}
-             # 或者 API: /repos/{owner}/{repo}/contents/{filepath}?ref={branch} (get content, base64)
-             # 这里使用 raw URL 可能会更方便,但要注意私有仓库可能需要token访问raw
-             # Gitea raw URL usually works with token in header or query param. 
-             # Standard Gitea: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch} (API) returns raw content? 
-             # Actually Gitea raw url: {base}/{owner}/{repo}/raw/branch/{branch}/{path} or /raw/tag or /raw/commit
-            
-            # 使用API raw endpoint: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch} ==> 实际是 /repos/{owner}/{repo}/raw/{path} (ref通过query param?)
-            # 查阅文档,Gitea API v1 /repos/{owner}/{repo}/raw/{filepath} 接受 ref query param
-            # URL: {base}/api/v1/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={branch}
+            # 使用API raw endpoint: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch}
              files.append({
                 "path": item["path"],
-                "url": f"{base}/api/v1/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={quote(branch)}",
+                "url": f"{base_url}/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={quote(branch)}",
                 "token": token # 传递token以便fetch_file_content使用
             })
     
@@ -482,11 +430,11 @@
                     # 使用提取的 token 或用户配置的 token
                     
                     if repo_type == "gitlab":
-                         token_to_use = extracted_token or gitlab_token
+                         token_to_use = file_info.get('token') or gitlab_token
                          if token_to_use:
                              headers["PRIVATE-TOKEN"] = token_to_use
                     elif repo_type == "gitea":
-                         token_to_use = extracted_token or gitea_token
+                         token_to_use = file_info.get('token') or gitea_token
                          if token_to_use:
                              headers["Authorization"] = f"token {token_to_use}"
                     elif repo_type == "github":

# File: backend/app/services/scanner.py
@@ -9,6 +9,7 @@
 from urllib.parse import urlparse, quote
 from sqlalchemy.ext.asyncio import AsyncSession
 
+from app.utils.repo_utils import parse_repository_url
 from app.models.audit import AuditTask, AuditIssue
 from app.models.project import Project
 from app.services.llm.service import LLMService
@@ -149,17 +150,8 @@
 
 async def get_github_branches(repo_url: str, token: str = None) -> List[str]:
     """获取GitHub仓库分支列表"""
-    match = repo_url.rstrip('/')
-    if match.endswith('.git'):
-        match = match[:-4]
-    if 'github.com/' in match:
-        parts = match.split('github.com/')[-1].split('/')
-        if len(parts) >= 2:
-            owner, repo = parts[0], parts[1]
-        else:
-            raise Exception("GitHub 仓库 URL 格式错误")
-    else:
-        raise Exception("GitHub 仓库 URL 格式错误")
+    repo_info = parse_repository_url(repo_url, "github")
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     branches_url = f"https://api.github.com/repos/{owner}/{repo}/branches?per_page=100"
     branches_data = await github_api(branches_url, token)
@@ -172,20 +164,11 @@
 
 async def get_gitea_branches(repo_url: str, token: str = None) -> List[str]:
     """获取Gitea仓库分支列表"""
-    parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
-    
-    # 提取Owner和Repo: path通常是 /owner/repo.git 或 /owner/repo
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    parts = path.split('/')
-    if len(parts) < 2:
-         raise Exception("Gitea 仓库 URL 格式错误")
-    
-    owner, repo = parts[0], parts[1]
-    
-    branches_url = f"{base}/api/v1/repos/{owner}/{repo}/branches"
+    repo_info = parse_repository_url(repo_url, "gitea")
+    base_url = repo_info['base_url'] # This is {base}/api/v1
+    owner, repo = repo_info['owner'], repo_info['repo']
+    
+    branches_url = f"{base_url}/repos/{owner}/{repo}/branches"
     branches_data = await gitea_api(branches_url, token)
     
     return [b["name"] for b in branches_data]
@@ -194,7 +177,6 @@
 async def get_gitlab_branches(repo_url: str, token: str = None) -> List[str]:
     """获取GitLab仓库分支列表"""
     parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
     
     extracted_token = token
     if parsed.username:
@@ -203,14 +185,11 @@
         elif parsed.username and not parsed.password:
             extracted_token = parsed.username
     
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    if not path:
-        raise Exception("GitLab 仓库 URL 格式错误")
-    
-    project_path = quote(path, safe='')
-    branches_url = f"{base}/api/v4/projects/{project_path}/repository/branches?per_page=100"
+    repo_info = parse_repository_url(repo_url, "gitlab")
+    base_url = repo_info['base_url']
+    project_path = quote(repo_info['project_path'], safe='')
+    
+    branches_url = f"{base_url}/projects/{project_path}/repository/branches?per_page=100"
     branches_data = await gitlab_api(branches_url, extracted_token)
     
     return [b["name"] for b in branches_data]
@@ -219,17 +198,8 @@
 async def get_github_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取GitHub仓库文件列表"""
     # 解析仓库URL
-    match = repo_url.rstrip('/')
-    if match.endswith('.git'):
-        match = match[:-4]
-    if 'github.com/' in match:
-        parts = match.split('github.com/')[-1].split('/')
-        if len(parts) >= 2:
-            owner, repo = parts[0], parts[1]
-        else:
-            raise Exception("GitHub 仓库 URL 格式错误")
-    else:
-        raise Exception("GitHub 仓库 URL 格式错误")
+    repo_info = parse_repository_url(repo_url, "github")
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     # 获取仓库文件树
     tree_url = f"https://api.github.com/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
@@ -251,7 +221,6 @@
 async def get_gitlab_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取GitLab仓库文件列表"""
     parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
     
     # 从URL中提取token(如果存在)
     extracted_token = token
@@ -262,16 +231,12 @@
             extracted_token = parsed.username
     
     # 解析项目路径
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    if not path:
-        raise Exception("GitLab 仓库 URL 格式错误")
-    
-    project_path = quote(path, safe='')
+    repo_info = parse_repository_url(repo_url, "gitlab")
+    base_url = repo_info['base_url'] # {base}/api/v4
+    project_path = quote(repo_info['project_path'], safe='')
     
     # 获取仓库文件树
-    tree_url = f"{base}/api/v4/projects/{project_path}/repository/tree?ref={quote(branch)}&recursive=true&per_page=100"
+    tree_url = f"{base_url}/projects/{project_path}/repository/tree?ref={quote(branch)}&recursive=true&per_page=100"
     tree_data = await gitlab_api(tree_url, extracted_token)
     
     files = []
@@ -279,7 +244,7 @@
         if item.get("type") == "blob" and is_text_file(item["path"]) and not should_exclude(item["path"], exclude_patterns):
             files.append({
                 "path": item["path"],
-                "url": f"{base}/api/v4/projects/{project_path}/repository/files/{quote(item['path'], safe='')}/raw?ref={quote(branch)}",
+                "url": f"{base_url}/projects/{project_path}/repository/files/{quote(item['path'], safe='')}/raw?ref={quote(branch)}",
                 "token": extracted_token
             })
     
@@ -289,40 +254,23 @@
 
 async def get_gitea_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
     """获取Gitea仓库文件列表"""
-    parsed = urlparse(repo_url)
-    base = f"{parsed.scheme}://{parsed.netloc}"
-    
-    path = parsed.path.strip('/')
-    if path.endswith('.git'):
-        path = path[:-4]
-    parts = path.split('/')
-    if len(parts) < 2:
-         raise Exception("Gitea 仓库 URL 格式错误")
-    
-    owner, repo = parts[0], parts[1]
+    repo_info = parse_repository_url(repo_url, "gitea")
+    base_url = repo_info['base_url']
+    owner, repo = repo_info['owner'], repo_info['repo']
     
     # Gitea tree API: GET /repos/{owner}/{repo}/git/trees/{sha}?recursive=1
     # 可以直接使用分支名作为sha
-    tree_url = f"{base}/api/v1/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
+    tree_url = f"{base_url}/repos/{owner}/{repo}/git/trees/{quote(branch)}?recursive=1"
     tree_data = await gitea_api(tree_url, token)
     
     files = []
     for item in tree_data.get("tree", []):
          # Gitea API returns 'type': 'blob' for files
         if item.get("type") == "blob" and is_text_file(item["path"]) and not should_exclude(item["path"], exclude_patterns):
-             # Gitea raw file URL: {base}/{owner}/{repo}/raw/branch/{branch}/{path}
-             # 或者 API: /repos/{owner}/{repo}/contents/{filepath}?ref={branch} (get content, base64)
-             # 这里使用 raw URL 可能会更方便,但要注意私有仓库可能需要token访问raw
-             # Gitea raw URL usually works with token in header or query param. 
-             # Standard Gitea: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch} (API) returns raw content? 
-             # Actually Gitea raw url: {base}/{owner}/{repo}/raw/branch/{branch}/{path} or /raw/tag or /raw/commit
-            
-            # 使用API raw endpoint: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch} ==> 实际是 /repos/{owner}/{repo}/raw/{path} (ref通过query param?)
-            # 查阅文档,Gitea API v1 /repos/{owner}/{repo}/raw/{filepath} 接受 ref query param
-            # URL: {base}/api/v1/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={branch}
+            # 使用API raw endpoint: GET /repos/{owner}/{repo}/raw/{filepath}?ref={branch}
              files.append({
                 "path": item["path"],
-                "url": f"{base}/api/v1/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={quote(branch)}",
+                "url": f"{base_url}/repos/{owner}/{repo}/raw/{quote(item['path'])}?ref={quote(branch)}",
                 "token": token # 传递token以便fetch_file_content使用
             })
     
@@ -482,11 +430,11 @@
                     # 使用提取的 token 或用户配置的 token
                     
                     if repo_type == "gitlab":
-                         token_to_use = extracted_token or gitlab_token
+                         token_to_use = file_info.get('token') or gitlab_token
                          if token_to_use:
                              headers["PRIVATE-TOKEN"] = token_to_use
                     elif repo_type == "gitea":
-                         token_to_use = extracted_token or gitea_token
+                         token_to_use = file_info.get('token') or gitea_token
                          if token_to_use:
                              headers["Authorization"] = f"token {token_to_use}"
                     elif repo_type == "github":

The URL parsing logic for GitHub, GitLab, and Gitea is duplicated and
inconsistent. This should be refactored into a single, unified utility function
that handles all supported repository providers.

Examples:

backend/app/services/scanner.py [219-233]
async def get_github_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
    """获取GitHub仓库文件列表"""
    # 解析仓库URL
    match = repo_url.rstrip('/')
    if match.endswith('.git'):
        match = match[:-4]
    if 'github.com/' in match:
        parts = match.split('github.com/')[-1].split('/')
        if len(parts) >= 2:
            owner, repo = parts[0], parts[1]

 ... (clipped 5 lines)
backend/app/services/scanner.py [290-302]
async def get_gitea_files(repo_url: str, branch: str, token: str = None, exclude_patterns: List[str] = None) -> List[Dict[str, str]]:
    """获取Gitea仓库文件列表"""
    parsed = urlparse(repo_url)
    base = f"{parsed.scheme}://{parsed.netloc}"
    
    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    if len(parts) < 2:

 ... (clipped 3 lines)

Solution Walkthrough:

Before:

# In backend/app/services/scanner.py

async def get_github_files(repo_url, ...):
    # Manual string splitting to get owner/repo
    match = repo_url.rstrip('/')
    if match.endswith('.git'):
        match = match[:-4]
    parts = match.split('github.com/')[-1].split('/')
    owner, repo = parts[0], parts[1]
    # ... use owner/repo

async def get_gitea_files(repo_url, ...):
    # urlparse to get owner/repo
    parsed = urlparse(repo_url)
    path = parsed.path.strip('/')
    if path.endswith('.git'):
        path = path[:-4]
    parts = path.split('/')
    owner, repo = parts[0], parts[1]
    # ... use owner/repo

# ... and similar logic is repeated for GitLab and all get_*_branches functions.

After:

# In a new utility file or within scanner.py

def parse_repository_url(repo_url: str, repo_type: str) -> Dict[str, str]:
    """Parses a repository URL and returns its components."""
    repo_url = repo_url.rstrip('/')
    if repo_url.endswith('.git'):
        repo_url = repo_url[:-4]
    
    parsed = urlparse(repo_url)
    base_url = f"{parsed.scheme}://{parsed.netloc}"
    path_parts = parsed.path.strip('/').split('/')

    if repo_type == "github":
        owner, repo = path_parts[-2], path_parts[-1]
        return {"base_url": "https://api.github.com", "owner": owner, "repo": repo}
    elif repo_type in ["gitlab", "gitea"]:
        owner, repo = path_parts[0], path_parts[1] # Simplified for example
        return {"base_url": f"{base_url}/api/v1", "owner": owner, "repo": repo}
    # ...

# In backend/app/services/scanner.py
async def get_github_files(repo_url, ...):
    repo_info = parse_repository_url(repo_url, "github")
    # ... use repo_info['owner'] and repo_info['repo']

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies significant code duplication and inconsistency in URL parsing across github, gitlab, and the newly added gitea functions, proposing a valuable refactoring that would greatly improve code maintainability and structure.

Medium
  • [ ] Update

#22

vinland100 avatar Dec 16 '25 11:12 vinland100

@lintsinghua 佬,看一下这个怎么样

vinland100 avatar Dec 17 '25 03:12 vinland100

@lintsinghua 佬,看一下这个怎么样

好的,要等一段时间审查合并。有些功能还在开发中。

lintsinghua avatar Dec 17 '25 03:12 lintsinghua

@lintsinghua 佬,看一下这个怎么样

好的,要等一段时间审查合并。有些功能还在开发中。

好的,辛苦啦

vinland100 avatar Dec 17 '25 03:12 vinland100