XRay
XRay copied to clipboard
Remove images from posts containing a photo
Discussion from IRC:
- https://chat.indieweb.org/dev/2017-12-23#t1513996649268600
- https://chat.indieweb.org/dev/2018-01-04#t1515107952060700
Encountered two blockers working on this:
- In a simple example of an
img
tag inside ane-content
tag, the parsers are using theimg
tag as an implied photo property. This seems wrong to me. Example This means XRay sees a post like this as a photo post, and would remove the img tag from the content, which is definitely not the right thing to do.
<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg"></p></div>
{
"type": [
"h-entry"
],
"properties": {
"name": [
"Hello World http://example.com/example.jpg"
],
"content": [
{
"html": "Hello World <img src=\"http://example.com/example.jpg\">",
"value": "Hello World http://example.com/example.jpg"
}
],
"photo": [
"http://example.com/example.jpg"
]
}
}
- At the point that XRay is sanitizing the HTML value, the Microformats parser has already converted the HTML to plaintext.
For example, XRay sees this object and runs the HTML sanitizer on the HTML value:
{
"html": "Hello World <img src=\"http://example.com/example.jpg\">",
"value": "Hello World http://example.com/example.jpg"
}
This means I can't remove the img
tag from the plaintext value since it's already been parsed. I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can't be sure exactly what method was used to generate the plaintext value from the original HTML anyway.
Another question/problem is what should I do in the case where the img
tag in the e-content
contains alt text? That alt text will have already been brought into the plaintext values for e-content
and maybe even the p-name
.
<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg" class="u-photo" alt="An Example Photo"></p></div>
{
"type": [
"h-entry"
],
"properties": {
"name": [
"Hello World An Example Photo"
],
"photo": [
"http://example.com/example.jpg"
],
"content": [
{
"html": "Hello World <img src=\"http://example.com/example.jpg\" class=\"u-photo\" alt=\"An Example Photo\">",
"value": "Hello World An Example Photo"
}
]
}
}
(example)
This is filed as an issue on the parsing spec here: https://github.com/microformats/microformats2-parsing/issues/16
Ideally the parsing spec would not have included that alt text in the plaintext values in the first place.
Current status: blocked. Breaking this out into separate issues so they can be tracked.
whoops, not actually closing this because I haven't committed the code that actually does that logic yet.
An example feed where a photo
property is created, but not removed from the content: http://feeds2.feedburner.com/thenextweb