cdxgen Occurrence line numbers are off

Running cdxgen with research profile on the hyades-apiserver repository yields occurrences with incorrect line numbers.

git clone https://github.com/DependencyTrack/hyades-apiserver.git cdxgen-hyades-apiserver
cd cdxgen-hyades-apiserver

# For reproducability. This is the latest commit as of creation of this issue.
git reset --hard cf2744a829bf97d61fe42c80a019d52e5fb56098

docker run --rm \
  -e CDXGEN_DEBUG_MODE=debug \
  -v /tmp:/tmp \
  -v $(pwd):/app:rw \
  --pull always \
  -t ghcr.io/cyclonedx/cdxgen:master \
  -o /app/bom.json -t java --profile research . -p

jq '.' bom.json > bom-formatted.json

Result: bom-formatted.json

For the component alpine-common, the first 3 occurrences listed are:

{
  "location": "src/main/java/org/dependencytrack/auth/Permissions.java#36"
},
{
  "location": "src/main/java/org/dependencytrack/common/ClusterInfo.java#91"
},
{
  "location": "src/main/java/org/dependencytrack/common/HttpClientPool.java#200"
}

Line 36 in Permissions.java is the definition of an enum field.
ClusterInfo.java only has 85 lines.
HttpClientPool.java only has 137 lines.

Perhaps there is some sort of offset which is miscalculated?

Mar 04 '25 11:03 nscuro

Interesting. Something is terribly wrong somewhere. There are no properties named "Namespaces", so I would expect no occurrence evidences for these. cdx:bom:componentNamespaces appears repeated under metadata.component. Let me investigate.

Mar 04 '25 11:03 prabhu

A BOM I generated yesterday had similar oddities, although I am failing to reproduce it now. I used the same cdxgen command as above, but it was run on the repository in a state where the project was already built (i.e. generated-sources populated and JAR in target/ directory).

bom-formatted2.json

What's interesting in this one is that it includes occurrences in target/generated-sources, for example:

{
  "location": "target/generated-sources/org/cyclonedx/proto/v1_6/Advisory.java#124"
}

The kafka-clients component has many occurrences assigned to it that are not correct, and amount to multiple thousands.

Again, I'm failing to reproduce this now, but perhaps sharing the BOM helps.

Mar 04 '25 12:03 nscuro

@nscuro It reuses app.atom and slices json files for performance reasons. Regarding excluding directories for evidence purposes, it's a feature that needs to be added. There are some default excludes in atom, but looks like generated-sources is not there.

Mar 04 '25 13:03 prabhu

Ah, that makes sense. Ok so when I nuke the atom and slices files, the linked BOM is reproducible. Should I raise a separate issue for the erroneous occurrence assignments, e.g. on kafka-clients?

Mar 04 '25 14:03 nscuro

No problem. Looking into this issue now. Thank you so much for checking!

Mar 04 '25 15:03 prabhu

@nscuro Thank you so much for flagging this issue.

Regarding Enums appearing in occurrence evidence for alpine-common.

This is correct behaviour! Those enums and internal types are used in annotations and logging functions in the code base. If there is a vulnerability in alpine-common, we need to track all the internal types that are passing untainted via those external libraries. While occurrence evidence is comprehensive (over-tainted), reachables slices is precise (see attached zip).

src/main/java/org/dependencytrack/auth/Permissions.java#36 - This file and line number are correct.
src/main/java/org/dependencytrack/common/ClusterInfo.java#91 - The file is correct, but line number is wrong. The line number 91 belongs to a different file and method. However, line 63 in ClusterInfo is identical. Config.isUnitTestsEnabled(). I'm suspecting a bug in evinser.js that is getting the line number incorrect. Will keep this issue open and continue investigating.

Regarding generated-sources for kafka-clients. This is fixed with this atom PR.

Noticed that you were running --profile research via docker. Unfortunately, hyades is a large codebase so atom needs to be invoked in java mode to get reachables slices working (callstack evidence). Correct usage of evinse is as follows (4 steps):

git clone https://github.com/DependencyTrack/hyades-apiserver.git cdxgen-hyades-apiserver
cd cdxgen-hyades-apiserver
git reset --hard cf2744a829bf97d61fe42c80a019d52e5fb56098
mvn compile

# Run cdxgen in deep mode.
cdxgen -t java --deep -o bom.json $(pwd)

# Run atom in java mode
# During reachables slicing, bom.json would get used to compute the purls
# When we run reachables slicing first, app.atom would include data dependencies.
# Such a rich atom file is useful for both reachables and usages slicing.
<atom dir>/atom.sh -J-Xmx24G reachables -l java -o app.atom -s reachables.slices.json $(pwd)

# Usages slicing will be faster since the app.atom file would get reused.
<atom dir>/atom.sh -J-Xmx24G usages -l java -o app.atom -s usages.slices.json $(pwd)

# Now run evinse
evinse -l java -i bom.json -o bom.evinse.json

--profile research tries to automate these four steps. However, I have not figured out why the performance of atom is much slower when invoked from node.js processes. (Technically cdxgen node, invoking atom node, invoking java).

Please review the attached zip file and let me know your thoughts.

Archive.zip

Mar 05 '25 00:03 prabhu

Thanks @prabhu, appreciate the thorough response.

Regarding Enums appearing in occurrence evidence for alpine-common.

This is correct behaviour! Those enums and internal types are used in annotations and logging functions in the code base.

That makes sense, however I believe it can be confusing when presented this way in occurrences (as seen by my own misunderstanding). Here I'd much rather expect "obvious", coarser usages of the library in question, perhaps even limited to one occurrence per file. As you say, reachables (or callstack info in CycloneDX lingo) are where the detail is / should be. I am a total noob in this area so please take what I say with a big grain of salt.

Unfortunately, hyades is a large codebase so atom needs to be invoked in java mode to get reachables slices working (callstack evidence).

Ah, classic case of RTFM! This is on me, I should've checked the docs first. Thanks for the hint, I'll give this a try.

I have not figured out why the performance of atom is much slower when invoked from node.js processes.

Is this problem just about slowness, or will it plainly not work using the --profile research way?

Mar 05 '25 11:03 nscuro

@nscuro, it's a good feedback. Usages slices specification is designed with semantics in mind. However, translating that rich structure into a single array for occurrences evidence limits its potential, but serves some higher level use-cases such as identifying a heat-map or training ML models that have limited context window, and so on.

The java-node situation definitely could be improved. Just needs someone to run debuggers and investigate why the java process is unable to utilize all cores and threads. There is a default timeout of 10 minutes, so often reachables slicing doesn't finish, requiring this 4-step work around.

Mar 05 '25 12:03 prabhu

With PR https://github.com/CycloneDX/cdxgen/pull/1672, I have improved the line numbers. It's gotten a bit coarse now for Java and Jar types, so should lead to less confusions around line numbers at least.

"occurrences": [
                    {
                        "location": "src/main/java/org/dependencytrack/auth/Permissions.java#70"
                    },
                    {
                        "location": "src/main/java/org/dependencytrack/common/ClusterInfo.java#62"
                    },
                    {
                        "location": "src/main/java/org/dependencytrack/common/HttpClientPool.java#123"
                    },
                    {
                        "location": "src/main/java/org/dependencytrack/common/HttpClientPool.java#86"
                    },

I hope this helps!

bom-full.json

Mar 05 '25 15:03 prabhu

Thanks, can confirm the line numbers appear to be correct now!

Also exclusion of the generated-sources directory has largely helped with incorrect assignment of Protobuf occurrences to kafka-clients.

I am still seeing minor ambiguities though. For kafka-clients, it's listing these occurrences:

{
  "location": "src/main/java/org/dependencytrack/model/mapping/PolicyProtoMapper.java#109"
},
{
  "location": "src/main/java/org/dependencytrack/parser/dependencytrack/NotificationModelConverter.java#107"
},
{
  "location": "src/main/java/org/dependencytrack/parser/dependencytrack/NotificationModelConverter.java#300"
},
{
  "location": "src/main/java/org/dependencytrack/policy/cel/CelPolicyEngine.java#221"
},
{
  "location": "src/main/java/org/dependencytrack/policy/cel/CelVulnerabilityPolicyEvaluator.java#105"
},
{
  "location": "src/main/java/org/dependencytrack/policy/cel/persistence/CelPolicyDao.java#296"
},
{
  "location": "src/main/java/org/dependencytrack/policy/cel/persistence/CelPolicyDao.java#306"
}

These do not call any Kafka code, however they could be called from our own Kafka consumer/producer logic. Note that kafka-clients itself does not call these classes though. Is this expected behavior of atom?

I know some projects choose to include generated Protobuf code in their src directory, and commit it to their Git repos. While ignoring the generated-sources directory mostly worked for us, I wonder if other projects can still run into this.

Mar 10 '25 10:03 nscuro

@nscuro, let me think through. You are asking for only one kind of usages. Perhaps, instead of dumbing down the research profile, we can simplify appsec profile to do what you are after?

Mar 10 '25 13:03 prabhu

You are asking for only one kind of usages. Perhaps, instead of dumbing down the research profile, we can simplify appsec profile to do what you are after?

Not necessarily asking for it, just trying to make sense of the data. Very much possible my initial expectation was just wrong 😅

For context, we just added support for ingestion of component occurrences in Dependency-Track. I've been using cdxgen to test this with "real" data since it's pretty much the only tool out there that even produces occurrences at this point!

Mar 11 '25 10:03 nscuro

Wow!!! Can't wait to try the new Dependency-Track. You guys are amazing!

Mar 11 '25 10:03 prabhu