mindsdb icon indicating copy to clipboard operation
mindsdb copied to clipboard

Adding keyword search

Open fshabashev opened this issue 10 months ago • 1 comments

Description

Please include a summary of the change and the issue it solves.

Fixes #issue_number

Type of change

(Please delete options that are not relevant)

  • [ ] 🐛 Bug fix (non-breaking change which fixes an issue)
  • [ ] ⚡ New feature (non-breaking change which adds functionality)
  • [ ] 📢 Breaking change (fix or feature that would cause existing functionality not to work as expected)
  • [ ] 📄 This change requires a documentation update

Verification Process

To ensure the changes are working as expected:

  • [ ] Test Location: Specify the URL or path for testing.
  • [ ] Verification Steps: Outline the steps or queries needed to validate the change. Include any data, configurations, or actions required to reproduce or see the new functionality.

Additional Media:

  • [ ] I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

  • [ ] My code follows the style guidelines(PEP 8) of MindsDB.
  • [ ] I have appropriately commented on my code, especially in complex areas.
  • [ ] Necessary documentation updates are either made or tracked in issues.
  • [ ] Relevant unit and integration tests are updated or added.

fshabashev avatar Jun 18 '25 02:06 fshabashev

Review Summary

🏷️ Draft Comments (4)

Skipped posting 4 drafted comments based on your review threshold. Feel free to update them here.

mindsdb/integrations/handlers/pgvector_handler/pgvector_handler.py (2)

0146-0160: add_full_text_index does not check for existing GIN indexes before creation, causing unnecessary index creation attempts and potential performance degradation on large tables.

Scores:

  • Production Impact: 2
  • Fix Specificity: 3
  • Urgency Impact: 2
  • Total Score: 7

Reason for filtering: The commitable suggestion is technically flawed and would break the code. The suggestion removes 'IF NOT EXISTS' from the CREATE INDEX statement but then tries to check for existing indexes manually. However, the manual check queries pg_indexes but doesn't handle the case where the table itself doesn't exist yet, which would cause the query to fail. Additionally, the existing code already uses 'CREATE INDEX IF NOT EXISTS' which is the standard PostgreSQL way to handle this scenario safely. The suggestion introduces unnecessary complexity and potential failure points.

Analysis: While the concern about performance is valid, the suggested fix is problematic and the current code already handles the scenario appropriately with IF NOT EXISTS clause.


0155-0158: add_full_text_index constructs SQL queries using unsanitized table_name and column_name, allowing SQL injection and unauthorized command execution.

Scores:

  • Production Impact: 3
  • Fix Specificity: 2
  • Urgency Impact: 3
  • Total Score: 8

Reason for filtering: The commitable suggestion uses .isidentifier() which is inappropriate for SQL identifiers. Python's .isidentifier() method checks for valid Python variable names, not SQL identifiers. SQL identifiers can contain characters that aren't valid Python identifiers (like hyphens, spaces when quoted, etc.). This would incorrectly reject valid SQL table/column names. Additionally, the table_name has already been processed through self._check_table() which applies proper namespacing and validation. The security concern is valid but the proposed fix is technically incorrect and would break legitimate use cases.

Analysis: Security concern is legitimate but the proposed fix using Python's .isidentifier() is inappropriate for SQL context and would break valid SQL identifiers. A proper SQL injection fix would require different validation or parameterized queries.


mindsdb/interfaces/knowledge_base/controller.py (2)

1012-1013: add_full_text_index is called unconditionally if keyword_search_enabled is True, but not all vector store handlers may implement this method, leading to a possible AttributeError at runtime.

Scores:

  • Production Impact: 4
  • Fix Specificity: 5
  • Urgency Impact: 3
  • Total Score: 12

Reason for filtering: This is a valid technical issue. The code at lines 1012-1013 calls add_full_text_index without checking if the method exists on the vector store handler. This could cause AttributeError at runtime if a handler doesn't implement this method. The suggested fix using hasattr is appropriate and safe.

Analysis: High production impact because AttributeError would crash the knowledge base creation process. The fix is very specific and immediately applicable. Medium urgency since it only occurs when keyword_search_enabled=True.


1012-1013: keyword_search_enabled parameter is not validated or restricted, allowing an attacker to trigger add_full_text_index on arbitrary tables, potentially leading to privilege escalation or denial of service if abused.

Scores:

  • Production Impact: 1
  • Fix Specificity: 1
  • Urgency Impact: 1
  • Total Score: 3

Reason for filtering: This comment makes several incorrect assumptions: 1) It assumes self.session.user.is_admin exists without evidence in the codebase, 2) It assumes this is a security vulnerability without proper context about the application's security model, 3) The suggested fix would likely break existing functionality by adding authorization checks that may not be appropriate at this layer, 4) The vector_table_name.isidentifier() check is overly restrictive and could break legitimate table names.

Analysis: The commitable suggestion would likely break the code by referencing non-existent attributes and adding inappropriate authorization logic. The security concern is speculative without proper context about the application's threat model and existing security controls.