connectors icon indicating copy to clipboard operation
connectors copied to clipboard

Incorrect Entity Extraction from PDF by ImportDocument Connector

Open ericWadeFord opened this issue 4 months ago • 5 comments

Description

When enriching an external report via the ImportExternalReference and ImportDocument connectors, the PDF parsing produces incorrect entity associations in the Analyst Workbench/Draft. Specifically, common terms such as "will" are misclassified as Threat Actors, "victim" as Identities, and "threat intelligence" as Malware, indicating flawed entity recognition and extraction logic during document ingestion.

Environment

6.7.9

Reproducible Steps

Steps to create the smallest reproducible scenario:

  1. Create a new report with an external reference. In testing, I used the following report: https://cloud.google.com/blog/topics/threat-intelligence/voice-phishing-data-extortion
  2. Enrich the external reference, running the ImportExternalReference connector, and wait for all actions to complete, including the ImportDocument connector.
  3. Navigate to the Data tab and click on the entry in the Analyst Workbench section.
  4. Under the Entities section, observe that the parsing of the PDF by the ImportDocument connector shows that several objects that should not have been included. In testing, these included the words will as a Threat Actor, victim as an Identity, threat intelligence as Malware, amongst others.

Expected Output

The parsing of objects is more refined, so objects that should not be created are not created.

Actual Output

The parsing of objects is unrefined, and objects that should not be created are created.

ericWadeFord avatar Aug 08 '25 13:08 ericWadeFord

@ericWadeFord I don't think it's a bug, but rather un unwanted behavior. ImportDoc does not create knowledge out of nowhere, but simply parses the doc & looks for corresponding data in your platform. Therefore, if "will" is detected as a threat actor, it means your platform has ingested as a threat actor initially.

So I feel that what you're looking is maybe a way to either define a blacklist of entities that you don't want to ingest, or a way to easily convert entities that are wrongly ingested. And maybe, a better way to control the sources bringing you data that are not up to your standards. Am I correct?

nino-filigran avatar Aug 18 '25 07:08 nino-filigran

@nino-filigran, I was not aware of this behavior. I will reach out and get the reporter's desired course of action.

ericWadeFord avatar Aug 18 '25 12:08 ericWadeFord

Correct for all of the above

ericWadeFord avatar Dec 11 '25 19:12 ericWadeFord

@nino-filigran, there is one aspect I need clarification on.

You mentioned that the ImportDoc does not create data out of nowhere. Can you explain what happens when the Draft shows the operation as "Create"?

For instance, the following screenshot shows the creation of an Individual entity with the name "victim"

Image

However, this entity does not exist in the platform, hence the create operation. If the entity did exist, it would be an update operation.

ericWadeFord avatar Dec 11 '25 19:12 ericWadeFord

Sorry, I have not been 100% specific. It does create data, but only for some entities. You can find the doc here: https://github.com/OpenCTI-Platform/connectors/tree/master/internal-import-file/import-document

nino-filigran avatar Dec 12 '25 08:12 nino-filigran