Refactor extraction process. Closes #622.
Description
This PR introduces a complete rework of the extraction process. The idea is to improve testability, extensibility and maintainability by following some best practices:
- repository pattern: repositories handle data access without containing any processing logic
- single responsibility: every class in the process has one clear and recognizable responsibility
- dependency injection: dependencies are injected through constructors which makes testing much easier
- strategy pattern: makes it easier to add new "special treatment" for honeypots
The new process flow looks like this:
sequenceDiagram
participant Job as ExtractionJob
participant Pipeline as ExtractionPipeline
participant Elastic as ElasticRepository
participant Factory as StrategyFactory
participant Strategy as ExtractionStrategy
participant Processor as IocProcessor
participant Repo as IocRepository
Job->>Pipeline: execute()
Pipeline->>Elastic: search(minutes_back)
Elastic-->>Pipeline: hits[]
loop Each honeypot
Pipeline->>Factory: get_strategy(honeypot)
Factory-->>Pipeline: strategy
Pipeline->>Strategy: extract_from_hits(hits)
Strategy->>Strategy: iocs_from_hits(hits)
loop Each IOC
Strategy->>Processor: add_ioc(ioc)
Processor->>Repo: get_ioc_by_name(name)
alt IOC exists
Processor->>Processor: merge_iocs()
Processor->>Repo: save(ioc)
else New IOC
Processor->>Repo: save(ioc)
end
end
end
Pipeline->>Pipeline: UpdateScores()
A single ExtractionPipeline instance orchestrates the extraction of all available honeypots. Is uses the ElasticRepository to receive a list of all honeypot hits from a certain time window. For each honeypot it gets the corresponding ExtractionStrategy, which contains all the extraction logic that is specific for a certain type of honeypot (e.g. Cowrie). The ExtractionStrategy uses this logic to create IOC objects and hands them to the IocProcessor, which is responsible for - well - processing them so they can be written to the database via the IocRepository.
Key changes (functional)
- Sensors are now extracted in every extraction run. No extra job needed.
- General honeypots that are not in the database yet, are automatically added and extracted (until disabled manually).
Next steps
- Thoroughly test the new process in a production-like environment. Although I wrote a lot of tests, we might still find some bugs, as the extraction process is quite complex. This should be done before we merge the changes to
main. - Create a honeypot exclusion list, which contains all honeypots that we do not want to have in our database (e.g. Ddospot) and stop them from being extracted.
- Remove the hard-coded "general honeypots".
- Refactor the Cowrie extraction process (=CowrieExtractionStrategy) and write tests for it.
- Write end-to-end pipeline tests. This should be done after Cowrie extraction is refactored.
- Use the repositories for other purposes as well (e.g. scoring).
(I will open separate issues / PRs for them.)
Related issues
- Closes #530
- Closes #606
- Closes #622
Type of change
- [x] Bug fix (non-breaking change which fixes an issue).
Checklist
- [x] I have read and understood the rules about how to Contribute to this project.
- [x] The pull request is for the branch
develop. - [x] I have added documentation of the new features.
- [x] Linters (
Black,Flake,Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf. - [x] I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
- [ ] If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
- [ ] If the GUI has been modified:
- [ ] I have a provided a screenshot of the result in the PR.
- [ ] I have created new frontend tests for the new component or updated existing ones.
Important Rules
- If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
- If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
- Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.
Sorry @mlodic for the huge amount of changes in a single PR. But don't be scared, most of the lines are doc strings and tests anyway. :D