windmill icon indicating copy to clipboard operation
windmill copied to clipboard

Specific asset tables

Open diegoimbert opened this issue 2 weeks ago • 3 comments

Detect which specific tables were accessed from a ducklake/datatable/db resource when parsing assets.

diegoimbert avatar Dec 09 '25 16:12 diegoimbert

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: c694d49
Status: ✅  Deploy successful!
Preview URL: https://e6677f37.windmill.pages.dev
Branch Preview URL: https://di-specific-asset-tables-2.windmill.pages.dev

View logs

Claude finished @diegoimbert's task in 3m 43s —— View job


Code Review Complete ✅

This PR introduces specific table tracking for database assets (Ducklake, DataTable, and database resources), enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a significant enhancement to the asset tracking system.


Summary of Changes

Backend Changes

  • Asset Parsers Enhanced: SQL, Python, and TypeScript parsers now detect table-level access patterns
  • Access Type Tracking: Distinguishes between Read (R), Write (W), and Read-Write (RW) operations
  • Database Support: Handles ATTACH, USE, and dot-notation syntax (e.g., db.table_name)
  • Path Format: Assets now use hierarchical paths like ducklake_name/table_name or u/user/pg_db/table_name

Frontend Changes

  • UI Updates: Components updated to display and manage specific tables
  • DB Manager: Opens directly to specific tables when clicked from asset lists
  • Asset Display: Improved icons and formatting for table-specific assets
  • Path Parsing: Logic added to extract base database and specific table from asset paths

Code Quality Assessment

✅ Strengths

  1. Comprehensive Testing: All three parsers (SQL, Python, TypeScript) include extensive unit tests covering edge cases
  2. Consistent Design: Similar patterns used across all three language parsers
  3. Type Safety: Strong typing in both Rust backend and TypeScript frontend
  4. Backward Compatibility: Handles assets without specific tables (falls back to database-level tracking)
  5. Access Type Merging: Smart logic to combine R/W access types when same table accessed multiple times

⚠️ Areas for Improvement

I've left inline comments on specific issues. Key concerns:

1. Path Validation & Security

  • No validation that table names don't contain / characters, which could create malformed paths
  • String interpolation in SQL construction (Python/TS parsers) should be reviewed for edge cases
  • Consider sanitizing user input from parsed code before using in paths

2. Access Type Merging Logic (Priority: Medium)

The current logic in asset_parser.rs:46-52 treats (Some(R), None)None. This loses information. Consider:

(None, Some(x)) | (Some(x), None) => Some(x),  // Preserve known access

3. SQL Parser Stack Management (Priority: High)

The Query statement pushes R twice (lines 210 and 344), which could cause incorrect access type detection in nested queries. This needs verification.

4. Frontend Path Parsing

Multiple locations split asset.path by / without validating the expected format:

  • ExploreAssetButton.svelte:61-62 - assumes 3+ segments for resources
  • Similar patterns in ducklake/datatable handlers
  • Should add defensive validation or use a helper function

5. Missing Asset Types in Frontend

lib.ts:parseAssetFromString() doesn't handle ducklake:// or datatable:// prefixes, though these exist in the backend parser.

6. Performance Considerations

  • TypeScript parser clones entire var_identifiers HashMap on every block scope entry
  • API query uses subqueries that could be optimized with JOINs for large datasets
  • Consider adding database indexes if not present

Security Review

✅ No Critical Issues Found

  • Asset paths are validated through parse_asset_syntax()
  • Database queries use parameterized queries (sqlx)
  • No SQL injection vectors identified
  • Proper workspace isolation maintained

💡 Recommendations

  1. Add input sanitization for table names extracted from user code
  2. Document path format expectations to prevent malformed paths
  3. Consider rate limiting asset parsing operations if exposed to untrusted code

Testing Recommendations

The PR includes good unit tests, but consider adding:

  1. Integration tests for full parse → store → retrieve flow
  2. Edge case tests:
    • Table names with special characters (table-name, table.name)
    • Very long paths
    • Unicode table names
    • Nested queries with multiple access types
  3. Performance tests with large codebases
  4. Frontend E2E tests for the DB Manager specific table navigation

Best Practices Adherence

✅ Follows Windmill Guidelines

  • Clean code with single responsibilities
  • Avoids duplication (mostly - see inline comments)
  • Rust performance optimizations applied correctly
  • Proper error handling throughout

⚠️ Minor Deviations

  • Some code duplication in frontend path parsing (see inline comments)
  • Could extract more helper functions for better testability
  • Some SQL query complexity that could be simplified

Reproduction Instructions for Testers

To test the specific asset tables feature, navigate to the script editor and create a new Python, TypeScript, or SQL script. Write code that accesses a database with specific tables - for example, in Python use db = wmill.datatable('mydata') followed by db.query('SELECT * FROM users'), or in SQL use ATTACH 'ducklake://mydl' AS dl; SELECT * FROM dl.customers;. After saving the script, go to the Assets page in the workspace navigation. You should see the specific tables listed (like mydata/users or mydl/customers) rather than just the database names. Each asset should show whether it's being read, written, or both. Click the "Manage" button next to any database asset to open the DB Manager directly to that specific table. In the Flow editor, when you reference a script with database assets, the flow's asset graph should display the specific tables being accessed with appropriate icons and access indicators.


Recommendation

Approve with minor revisions. The implementation is solid and well-tested. The issues identified are mostly minor improvements and edge case handling. The critical issue with SQL parser stack management (Query statement) should be verified, but likely works correctly in practice. Consider addressing the inline comments before merging, particularly:

  1. Fix the Query statement double-push issue if confirmed
  2. Add path validation in frontend
  3. Consider the access type merging logic improvement
  4. Add missing ducklake/datatable parsing in frontend

Overall, this is a valuable feature that significantly improves asset tracking granularity! 🎉

claude[bot] avatar Dec 10 '25 13:12 claude[bot]

Claude finished @diegoimbert's task in 3m 21s —— View job


Code Review Complete ✅

This PR introduces specific table tracking for database assets, enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a valuable enhancement to the asset tracking system.


Summary of Changes

Backend Changes

  • Asset Parsers Enhanced: SQL, Python, and TypeScript parsers now detect table-level access patterns
  • Access Type Tracking: Distinguishes between Read (R), Write (W), and Read-Write (RW) operations
  • Database Support: Handles ATTACH, USE, and dot-notation syntax (e.g., db.table_name)
  • Path Format: Assets now use hierarchical paths like ducklake_name/table_name or u/user/pg_db/table_name

Frontend Changes

  • UI Updates: Components updated to display and manage specific tables
  • DB Manager: Opens directly to specific tables when clicked from asset lists
  • Asset Display: Improved icons and formatting for table-specific assets
  • Path Parsing: Logic added to extract base database and specific table from asset paths

Code Quality Assessment

✅ Strengths

  1. Comprehensive Testing: All three parsers (SQL, Python, TypeScript) include extensive unit tests covering edge cases
  2. Consistent Design: Similar patterns used across all three language parsers
  3. Type Safety: Strong typing in both Rust backend and TypeScript frontend
  4. Backward Compatibility: Handles assets without specific tables (falls back to database-level tracking)
  5. Smart Access Type Merging: Logic to combine R/W access types when same table accessed multiple times

⚠️ Areas for Improvement

Building on the previous review, here are additional concerns and observations:

1. Access Type Merging Logic (Priority: HIGH)

The current logic in asset_parser.rs:46-52 treats (Some(R), None)None, which loses information. This is problematic because:

  • If an asset is first detected with R access, then later with ambiguous access (None), we lose the known R
  • This makes the tracking less useful than it could be

Recommendation: Change the merge logic to preserve known access types:

(None, None) => None,
(None, Some(x)) | (Some(x), None) => Some(x),  // Preserve known access
(Some(R), Some(W)) | (Some(W), Some(R)) => Some(RW),
// ... rest unchanged

2. SQL Parser Stack Management (Priority: HIGH - Needs Verification)

In windmill-parser-sql/src/asset_parser.rs:

  • Line 210: Statement::Query(_) pushes R in pre_visit_statement
  • Line 344: pre_visit_query also pushes R
  • This means nested queries have R pushed twice but only popped once in post_visit_query (line 352)
  • The post_visit_statement (line 333) pops the first one

This could cause incorrect access type detection in nested queries. For example:

INSERT INTO table1 
SELECT * FROM (SELECT * FROM table2)  -- nested query

The inner SELECT might incorrectly inherit W access from the INSERT context due to stack mismanagement.

Recommendation: Verify this behavior with nested query tests, or restructure to avoid double-pushing.

3. Path Validation & Security (Priority: MEDIUM)

Several areas lack input validation:

  • Backend: No validation that table names don't contain / characters before constructing paths like {db}/{table}

    • Example: If user code references table/with/slashes, the path becomes malformed
    • Occurs in: sql/asset_parser.rs:74, sql/asset_parser.rs:87, py/asset_parser.rs:197, ts/asset_parser.rs:190
  • Frontend: Path splitting assumes specific formats without validation

    • ExploreAssetButton.svelte:61-62: Assumes resource paths have ≥3 segments
    • If path is u/user (only 2 segments), pathParts[3] is undefined
    • Similar issues in ducklake/datatable handlers (lines 72-78)

Recommendation:

  • Backend: Sanitize table names before path construction (replace / with _ or reject)
  • Frontend: Add defensive checks or extract to a helper function

4. Frontend Code Duplication (Priority: LOW)

The pattern of splitting asset.path and extracting components is repeated in ExploreAssetButton.svelte:

  • Lines 61-62 (resources)
  • Lines 72-74 (ducklake)
  • Lines 76-78 (datatable)

Recommendation: Extract to a helper function:

function parseAssetPath(path: string, minSegments: number = 1): {
    basePath: string;
    specificTable?: string;
} {
    const parts = path.split('/');
    return {
        basePath: parts.slice(0, minSegments).join('/'),
        specificTable: parts[minSegments]
    };
}

5. Missing Asset Types in Frontend (Priority: MEDIUM)

lib.ts:parseAssetFromString() doesn't handle ducklake:// or datatable:// prefixes:

  • Lines 53-62 only handle res://, $res:, and s3://
  • But backend supports ducklake:// and datatable:// (see parse_asset_syntax in asset_parser.rs:89-92)

Recommendation: Add cases for ducklake and datatable parsing.

6. TypeScript Parser Performance (Priority: LOW)

windmill-parser-ts/src/asset_parser.rs:82:

let saved_var_identifiers = self.var_identifiers.clone();

This clones the entire HashMap on every block scope entry. For deeply nested code with many blocks, this could be inefficient.

Recommendation: Consider using a stack-based approach or reference counting if this becomes a performance bottleneck.

7. API Query Performance (Priority: LOW)

windmill-api/src/assets.rs:47-48:

AND (asset.usage_kind <> 'flow' OR asset.usage_path = ANY(SELECT path FROM flow WHERE workspace_id = $1))
AND (asset.usage_kind <> 'script' OR asset.usage_path = ANY(SELECT path FROM script WHERE workspace_id = $1))

Uses subqueries in the WHERE clause. For large datasets, this could be slow.

Recommendation: Consider using JOINs instead:

LEFT JOIN flow ON asset.usage_kind = 'flow' AND asset.usage_path = flow.path AND flow.workspace_id = $1
LEFT JOIN script ON asset.usage_kind = 'script' AND asset.usage_path = script.path AND script.workspace_id = $1
WHERE asset.workspace_id = $1
  AND (asset.kind <> 'resource' OR resource.path IS NOT NULL)
  AND (asset.usage_kind <> 'flow' OR flow.path IS NOT NULL)
  AND (asset.usage_kind <> 'script' OR script.path IS NOT NULL)

Also ensure indexes exist on flow.path and script.path.


Security Review

✅ No Critical Issues Found

  • Asset paths are validated through parse_asset_syntax()
  • Database queries use parameterized queries (sqlx)
  • No SQL injection vectors identified
  • Proper workspace isolation maintained

💡 Recommendations

  1. Add input sanitization for table names extracted from user code (especially for / characters)
  2. Document path format expectations to prevent malformed paths
  3. Consider rate limiting asset parsing operations if exposed to untrusted code

Testing Recommendations

The PR includes good unit tests, but consider adding:

  1. Integration tests for full parse → store → retrieve flow
  2. Edge case tests:
    • Table names with special characters (table-name, table.name, table/name)
    • Very long paths (stress test)
    • Unicode table names
    • Nested queries with multiple access types (to verify stack management)
    • Queries that reference same table multiple times with different access patterns
  3. Performance tests with large codebases (100+ asset references)
  4. Frontend E2E tests for the DB Manager specific table navigation

Adherence to Best Practices

✅ Follows Windmill Guidelines

  • Clean code with single responsibilities
  • Avoids duplication (mostly - some frontend duplication noted)
  • Rust performance optimizations applied correctly
  • Proper error handling throughout
  • No blocking operations in async contexts

⚠️ Minor Deviations

  • Some code duplication in frontend path parsing
  • Could extract more helper functions for better testability
  • SQL query complexity could be simplified

Additional Observations

  1. Good defensive programming: The parsers handle missing cases gracefully and fall back to unknown access types rather than failing
  2. Excellent test coverage: Each parser has comprehensive tests demonstrating various scenarios
  3. Consistent API: The asset format (kind/path/table) is consistent across all three languages
  4. Developer experience: The feature integrates well with existing UI components (DB Manager opening to specific tables is a nice touch)

Reproduction Instructions for Testers

To test the specific asset tables feature, navigate to the script editor and create a new Python, TypeScript, or SQL script. Write code that accesses a database with specific tables - for example, in Python use db = wmill.datatable('mydata') followed by db.query('SELECT * FROM users'), or in SQL use ATTACH 'ducklake://mydl' AS dl; SELECT * FROM dl.customers;, or in TypeScript use let sql = wmill.datatable('dt') followed by await sql\SELECT * FROM friends`.fetch(). After saving the script, go to the Assets page in the workspace navigation. You should see the specific tables listed (like mydata/usersormydl/customers`) rather than just the database names. Each asset should show whether it's being read, written, or both using R, W, or RW indicators. Click the "Manage" button next to any database asset to open the DB Manager directly to that specific table - the table should be pre-selected in the table list on the left sidebar. In the Flow editor, when you reference a script with database assets, the flow's asset graph should display the specific tables being accessed with appropriate icons and access indicators, distinguishing between read operations (shown with read indicators) and write operations (shown with write indicators).


Recommendation

Approve with minor revisions. The implementation is solid and well-tested. The issues identified are mostly minor improvements and edge case handling. The critical issues should be addressed:

Must Fix Before Merge:

  1. Verify SQL parser stack management for nested queries (or add test confirming it works)
  2. ⚠️ Consider the access type merging logic - current behavior might be intentional but loses information

Should Fix (Can be follow-up):

  1. Add path validation/sanitization for table names with /
  2. Add frontend path parsing validation
  3. Add missing ducklake/datatable parsing in frontend parseAssetFromString
  4. Extract frontend path parsing helper

Nice to Have:

  1. Optimize TypeScript parser HashMap cloning
  2. Optimize API query with JOINs instead of subqueries

Overall, this is a valuable feature that significantly improves asset tracking granularity! The code quality is high, testing is comprehensive, and the implementation follows established patterns. The concerns raised are mostly about edge cases and potential optimizations rather than fundamental issues. 🎉


claude[bot] avatar Dec 10 '25 15:12 claude[bot]