unstract
unstract copied to clipboard
Optimization and fix of file execution - parallel execution
What
- UN-2410
- Skips processing of duplicate input files based on
file hash, even if the file names differ. - Catches and handles unknown exceptions during
file processing initialization( this is a safety net for scenarios that shouldn't normally occur.). - Uses
file hashesfromfsspec metadata(instead of computingcontent hashes) for source connectors during file listing (ETL/TASK). - Optimizes file listing by reducing multiple
fsspec API calls—now uses a singlelistdircall to retrievemetadatamore efficiently. - Updates the Google Drive connector to use fsspec metadata for file hashes (the
checksumfield).
Why
- Prevents reprocessing the same file content uploaded under different names, avoiding redundant or unnecessary file execution.
- Improves performance and reliability of file listing, especially for cloud-based sources.
How
- Deduplication logic is based on
file hashobtained frommetadataorcontent hash
Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
- No
Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Screenshots
Checklist
I have read and understood the Contribution Guidelines.
| filepath | function | $$\textcolor{#23d18b}{\tt{passed}}$$ | SUBTOTAL |
|---|---|---|---|
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_logs}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_cleanup}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_client\_init}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_run\_container}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$ | $$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ | $$\textcolor{#23d18b}{\tt{1}}$$ |
| $$\textcolor{#23d18b}{\tt{TOTAL}}$$ | $$\textcolor{#23d18b}{\tt{11}}$$ | $$\textcolor{#23d18b}{\tt{11}}$$ |
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code