Incorrect Entity Extraction from PDF by ImportDocument Connector
Description
When enriching an external report via the ImportExternalReference and ImportDocument connectors, the PDF parsing produces incorrect entity associations in the Analyst Workbench/Draft. Specifically, common terms such as "will" are misclassified as Threat Actors, "victim" as Identities, and "threat intelligence" as Malware, indicating flawed entity recognition and extraction logic during document ingestion.
Environment
6.7.9
Reproducible Steps
Steps to create the smallest reproducible scenario:
- Create a new report with an external reference. In testing, I used the following report: https://cloud.google.com/blog/topics/threat-intelligence/voice-phishing-data-extortion
- Enrich the external reference, running the ImportExternalReference connector, and wait for all actions to complete, including the ImportDocument connector.
- Navigate to the Data tab and click on the entry in the Analyst Workbench section.
- Under the Entities section, observe that the parsing of the PDF by the ImportDocument connector shows that several objects that should not have been included. In testing, these included the words will as a Threat Actor, victim as an Identity, threat intelligence as Malware, amongst others.
Expected Output
The parsing of objects is more refined, so objects that should not be created are not created.
Actual Output
The parsing of objects is unrefined, and objects that should not be created are created.
@ericWadeFord I don't think it's a bug, but rather un unwanted behavior. ImportDoc does not create knowledge out of nowhere, but simply parses the doc & looks for corresponding data in your platform. Therefore, if "will" is detected as a threat actor, it means your platform has ingested as a threat actor initially.
So I feel that what you're looking is maybe a way to either define a blacklist of entities that you don't want to ingest, or a way to easily convert entities that are wrongly ingested. And maybe, a better way to control the sources bringing you data that are not up to your standards. Am I correct?
@nino-filigran, I was not aware of this behavior. I will reach out and get the reporter's desired course of action.
Correct for all of the above
@nino-filigran, there is one aspect I need clarification on.
You mentioned that the ImportDoc does not create data out of nowhere. Can you explain what happens when the Draft shows the operation as "Create"?
For instance, the following screenshot shows the creation of an Individual entity with the name "victim"
However, this entity does not exist in the platform, hence the create operation. If the entity did exist, it would be an update operation.
Sorry, I have not been 100% specific. It does create data, but only for some entities. You can find the doc here: https://github.com/OpenCTI-Platform/connectors/tree/master/internal-import-file/import-document