syft parse known fields in packages, allowing leniency

As shown in issue https://github.com/anchore/syft/issues/327 - there is a need to be lenient on some of the fields when parsing. This leniency is not only meant for Python packages but should also be applied elsewhere.

The way the PKG-INFO parser works is by looking at each line and making the following assumptions:

if the line is empty it skips to the next line
if the line starts with whitespace it captures it as a "field body conttinuation"
if it doesn't start with whitespace then it looks if there is a : to capture the field and the value

The problem with this approach is that it is trying to programmatically parse the file, assuming that any line that has at least one instance of : is a valid field, which it isn't.

There is also no need to capture multi-line descriptions. And lastly, there are packages out there that have fully invalid representations of the format (as seen in the linked issue).

In the case of Python, the fields are known, so this proposal should be about preference of known fields that are useful for syft rather than programmatically discovering every single field in the package.

In the case of the PR, it doesn't really fix the underlying programmatic access of fields, but rather, warns about seemingly invalid fields (https://github.com/anchore/syft/pull/328). The spec says "Summary" is a single line. So the parser should consume only one line and ignore the rest. That behavior will prevent unnecessary hard errors.

Feb 17 '21 15:02 alfredodeza

@anchore/tools I've moved this issue onto our internal board for consideration when we start looking at future enhancements to make

Aug 18 '22 20:08 spiffcs

I think we've taken this stance, but maybe not universally so. As written this is a little too broad to be actionable.

Jun 20 '24 20:06 wagoodman