intelmq
intelmq copied to clipboard
Regression on parsing invalid URLs
As a continuation of #2377, we have a regression on parsing invalid URLs. Previously, the urllib
was mach more liberal in processing URLs, now it rejects much more cases.
We use it for sanitize the URLs, and html_parser
is an example of bot that uses the liberal behavior in tests:
https://github.com/certtools/intelmq/blob/61c45acfb8cc60e1419abe7c57691561ef9ee072/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py#L47
https://github.com/certtools/intelmq/blob/61c45acfb8cc60e1419abe7c57691561ef9ee072/intelmq/tests/bots/parsers/html_table/test_parser_column_split.py#L73-L80
In patched Python versions (e.g. 3.11.4), this URL is rejected. We need to either decide against allowing such URLs, or redesign our sanitization.
Temporally, the test is skipped to unlock other work.