GreedyBear icon indicating copy to clipboard operation
GreedyBear copied to clipboard

Refactor extraction process. Closes #622.

Open regulartim opened this issue 1 week ago • 1 comments

Description

This PR introduces a complete rework of the extraction process. The idea is to improve testability, extensibility and maintainability by following some best practices:

  • repository pattern: repositories handle data access without containing any processing logic
  • single responsibility: every class in the process has one clear and recognizable responsibility
  • dependency injection: dependencies are injected through constructors which makes testing much easier
  • strategy pattern: makes it easier to add new "special treatment" for honeypots

The new process flow looks like this:

sequenceDiagram
    participant Job as ExtractionJob
    participant Pipeline as ExtractionPipeline
    participant Elastic as ElasticRepository
    participant Factory as StrategyFactory
    participant Strategy as ExtractionStrategy
    participant Processor as IocProcessor
    participant Repo as IocRepository
    
    Job->>Pipeline: execute()
    Pipeline->>Elastic: search(minutes_back)
    Elastic-->>Pipeline: hits[]
    
    loop Each honeypot
        Pipeline->>Factory: get_strategy(honeypot)
        Factory-->>Pipeline: strategy
        Pipeline->>Strategy: extract_from_hits(hits)
        Strategy->>Strategy: iocs_from_hits(hits)
        
        loop Each IOC
            Strategy->>Processor: add_ioc(ioc)
            Processor->>Repo: get_ioc_by_name(name)
            alt IOC exists
                Processor->>Processor: merge_iocs()
                Processor->>Repo: save(ioc)
            else New IOC
                Processor->>Repo: save(ioc)
            end
        end
    end
    
    Pipeline->>Pipeline: UpdateScores()

A single ExtractionPipeline instance orchestrates the extraction of all available honeypots. Is uses the ElasticRepository to receive a list of all honeypot hits from a certain time window. For each honeypot it gets the corresponding ExtractionStrategy, which contains all the extraction logic that is specific for a certain type of honeypot (e.g. Cowrie). The ExtractionStrategy uses this logic to create IOC objects and hands them to the IocProcessor, which is responsible for - well - processing them so they can be written to the database via the IocRepository.

Key changes (functional)

  • Sensors are now extracted in every extraction run. No extra job needed.
  • General honeypots that are not in the database yet, are automatically added and extracted (until disabled manually).

Next steps

  • Thoroughly test the new process in a production-like environment. Although I wrote a lot of tests, we might still find some bugs, as the extraction process is quite complex. This should be done before we merge the changes to main.
  • Create a honeypot exclusion list, which contains all honeypots that we do not want to have in our database (e.g. Ddospot) and stop them from being extracted.
  • Remove the hard-coded "general honeypots".
  • Refactor the Cowrie extraction process (=CowrieExtractionStrategy) and write tests for it.
  • Write end-to-end pipeline tests. This should be done after Cowrie extraction is refactored.
  • Use the repositories for other purposes as well (e.g. scoring).

(I will open separate issues / PRs for them.)

Related issues

  • Closes #530
  • Closes #606
  • Closes #622

Type of change

  • [x] Bug fix (non-breaking change which fixes an issue).

Checklist

  • [x] I have read and understood the rules about how to Contribute to this project.
  • [x] The pull request is for the branch develop.
  • [x] I have added documentation of the new features.
  • [x] Linters (Black, Flake, Isort) gave 0 errors. If you have correctly installed pre-commit, it does these checks and adjustments on your behalf.
  • [x] I have added tests for the feature/bug I solved. All the tests (new and old ones) gave 0 errors.
  • [ ] If changes were made to an existing model/serializer/view, the docs were updated and regenerated (check CONTRIBUTE.md).
  • [ ] If the GUI has been modified:
    • [ ] I have a provided a screenshot of the result in the PR.
    • [ ] I have created new frontend tests for the new component or updated existing ones.

Important Rules

  • If you miss to compile the Checklist properly, your PR won't be reviewed by the maintainers.
  • If your changes decrease the overall tests coverage (you will know after the Codecov CI job is done), you should add the required tests to fix the problem
  • Everytime you make changes to the PR and you think the work is done, you should explicitly ask for a review. After being reviewed and received a "change request", you should explicitly ask for a review again once you have made the requested changes.

regulartim avatar Dec 18 '25 08:12 regulartim

Sorry @mlodic for the huge amount of changes in a single PR. But don't be scared, most of the lines are doc strings and tests anyway. :D

regulartim avatar Dec 18 '25 14:12 regulartim