jsoup
jsoup copied to clipboard
Allow wildcards in whitelist attributes
HTML5 allows the use of data-foo, data-foo-bar, etc to specify information on elements. These are relatively harmless and should only contain text.
Currently, each data- attribute needs to be specified explicitly on a whitelist so that it's not removed by Jsoup.clean(). Can we add support for either:
- Wildcard attributes, e.g.
Whitelist.relaxed().addAttributes("a", "data-*")or - A new function, like
Whitelist.relaxed().allowDataAttributes("a")
Also wanted to add that I'm willing to contribute code to support this. Before doing so, I just want to make sure this change is acceptable and determine the best way to support it (options 1 or 2 above, or something entirely different). Thanks!
Hi @foo4u. Sorry for the late reply. I like option 2 (because I can't think of another case which it would be helpful for). Would be great if you write that.
I guess it would be more flexible if you implemented option 1 with regex patterns instead of only wildcards. E.g.: Whitelist.relaxed().addAttributes("a", "data-.*") Think about e.g. https://angularjs.org code that has attributes starting with "ng-". Also almost every second ;-) a new JS framework appears and these might require new attributes prefixes. With option 1 with regex support you would be more future-proof.
OK, handling examples like that makes sense. I'd be OK with either a prefix or a regex matcher. The prefix match seems simple and unlikely to let anyone shoot themselves in the foot.
Ok, will try to get a PR for prefix matching sent in a few weeks.
(Closing out old, dormant bugs. If you are still impacted by this, please reopen & vote.)
I am trying out jsoup to validate html pages. Works great so far.
Would have been awesome, if wildcards were possible with jsoup.
Ok, will try to get a PR for prefix matching sent in a few weeks.
Hi @foo4u , have you ever prepared a PR?
I have prepared a PR: https://github.com/jhy/jsoup/pull/1871
(Reopening as mentioned in earlier close, there is renewed interest here.)
Is there any update on this... it looks like the changes requested in the PR were made back in February?
I have been watching these for some months now because we have a need to not strip out aria-* attributes.