jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Allow wildcards in whitelist attributes

Open foo4u opened this issue 10 years ago • 11 comments

HTML5 allows the use of data-foo, data-foo-bar, etc to specify information on elements. These are relatively harmless and should only contain text.

Currently, each data- attribute needs to be specified explicitly on a whitelist so that it's not removed by Jsoup.clean(). Can we add support for either:

  1. Wildcard attributes, e.g. Whitelist.relaxed().addAttributes("a", "data-*") or
  2. A new function, like Whitelist.relaxed().allowDataAttributes("a")

foo4u avatar Dec 02 '14 16:12 foo4u

Also wanted to add that I'm willing to contribute code to support this. Before doing so, I just want to make sure this change is acceptable and determine the best way to support it (options 1 or 2 above, or something entirely different). Thanks!

foo4u avatar Dec 02 '14 16:12 foo4u

Hi @foo4u. Sorry for the late reply. I like option 2 (because I can't think of another case which it would be helpful for). Would be great if you write that.

jhy avatar Apr 03 '15 01:04 jhy

I guess it would be more flexible if you implemented option 1 with regex patterns instead of only wildcards. E.g.: Whitelist.relaxed().addAttributes("a", "data-.*") Think about e.g. https://angularjs.org code that has attributes starting with "ng-". Also almost every second ;-) a new JS framework appears and these might require new attributes prefixes. With option 1 with regex support you would be more future-proof.

remisbaima avatar Apr 03 '15 07:04 remisbaima

OK, handling examples like that makes sense. I'd be OK with either a prefix or a regex matcher. The prefix match seems simple and unlikely to let anyone shoot themselves in the foot.

jhy avatar Apr 03 '15 16:04 jhy

Ok, will try to get a PR for prefix matching sent in a few weeks.

foo4u avatar Apr 20 '15 01:04 foo4u

(Closing out old, dormant bugs. If you are still impacted by this, please reopen & vote.)

jhy avatar Nov 15 '17 00:11 jhy

I am trying out jsoup to validate html pages. Works great so far. Would have been awesome, if wildcards were possible with jsoup.

swapab avatar Feb 11 '20 11:02 swapab

Ok, will try to get a PR for prefix matching sent in a few weeks.

Hi @foo4u , have you ever prepared a PR?

promiselaoliu avatar Dec 23 '22 00:12 promiselaoliu

I have prepared a PR: https://github.com/jhy/jsoup/pull/1871

promiselaoliu avatar Dec 23 '22 05:12 promiselaoliu

(Reopening as mentioned in earlier close, there is renewed interest here.)

jhy avatar Jan 24 '23 08:01 jhy

Is there any update on this... it looks like the changes requested in the PR were made back in February?

I have been watching these for some months now because we have a need to not strip out aria-* attributes.

irandamay avatar Jun 28 '23 17:06 irandamay