wordpress-activitypub icon indicating copy to clipboard operation
wordpress-activitypub copied to clipboard

Unnecessary preg_replace?

Open edent opened this issue 2 years ago • 19 comments

I'm not sure that this line is needed:

https://github.com/Automattic/wordpress-activitypub/blob/db0f9c1b516b0755744d891ca595d572722c86cf/includes/class-shortcodes.php#L218

If I've understood correctly, the trim() removes the leading & trailing whitespace. Then the preg_replace removes a single character inside the remainder of the text.

I see the need for the time, but not the preg_replace.

In my tests, removing the line didn't make any difference to the content of the ActivityPub JSON.

edent avatar Sep 21 '23 21:09 edent

Maybe Mastodon has changed something in the meantime, but in the early days they had a nl2br or something similar and that generated a lot of space between paragraphs.

pfefferle avatar Sep 22 '23 07:09 pfefferle

That may still be the case. But I think this regex would only remove the first newline it encountered.

edent avatar Sep 22 '23 08:09 edent

It will replace it in the complete content: https://onlinephp.io/c/3e435

pfefferle avatar Sep 22 '23 08:09 pfefferle

I must stop doing regex in my head.... :laughing:

That said, I'm not sure it changes the presentation on Mastodon which, in my experience, tends to eat whitespace. But perhaps others aren't so forgiving?

edent avatar Sep 22 '23 16:09 edent

I also remember it being an issue with Mastodon specifically. Used a callback to overwrite, well, pretty much the entire post (or object) content to better suit my custom post types ... and I ran into the same thing.

janboddez avatar Sep 24 '23 19:09 janboddez

Shouldn't there be exceptions inside <pre>, as generated by for example code, verse and preformatted Gutenberg blocks?

aslakr avatar Jan 11 '24 20:01 aslakr

Shouldn't there be exceptions inside \<pre\>, as generated by for example code, verse and preformatted Gutenberg blocks?

Yes, I've run into this a couple times already :-)

Update: ran a quick experiment using the following ...

$content = preg_replace( '~<pre(?:[^>]*)>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s', '', $content );

That works, when I print $content to error_log(), at least. Because it looks like Mastodon itself then goes on to strip the newlines ...?

So I added $content = nl2br( $content, false );, which replaces the leftover newlines with <br>, but ... no dice.

janboddez avatar Jan 11 '24 21:01 janboddez

Yes, trying to parse HTML using regex might be a bit optimistic.

Newer versions of the Gutenberg-editor seems to replace \n with <br> https://github.com/WordPress/gutenberg/pull/55999 which creates it own problems https://github.com/westonruter/syntax-highlighting-code-block/issues/790. Also, it wouldn't help classic editor or manually edited <pre>-blocks.

Glitch seems to replace newlines with <br>

https://github.com/glitch-soc/mastodon/blob/cbc951627c88000494a18fdf61473b63c9d7f669/app/lib/advanced_text_formatter.rb#L10-L14

It would be nice to preserve the tabs in code blocks.

aslakr avatar Jan 11 '24 23:01 aslakr

I guess one need to change/remove the following:

https://github.com/Automattic/wordpress-activitypub/blob/b243ca93012701bf4b982549c5ac848ef416e429/includes/class-shortcodes.php#L237

https://github.com/Automattic/wordpress-activitypub/blob/b243ca93012701bf4b982549c5ac848ef416e429/includes/transformer/class-post.php#L620

https://github.com/Automattic/wordpress-activitypub/blob/b243ca93012701bf4b982549c5ac848ef416e429/includes/transformer/class-post.php#L620

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

aslakr avatar Jan 12 '24 07:01 aslakr

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

I'm seeing additional newlines in a lot of (other people's) federated WordPress posts, so I'm thinking: Yes.

My solution -- to the necessary preg_replace() calls seemingly not working (apparently when the block editor is involved) -- has been to use a custom filter.

Works wonders. I've just now updated it to not strip these characters from inside pre as outlined above, and that works as well (when I print $content to the debug log, it looks exactly as it should).

Except there's something stripping them after all, I'm guessing on the Mastodon side? (Which doesn't make a whole lot of sense, seeing as it introduces newlines elsewhere. Although ... clients are inconsistent here, too.)

My one remaining issue is that comments don't get filtered the same way.

janboddez avatar Jan 12 '24 09:01 janboddez

My solution -- to the necessary preg_replace() calls seemingly not working (apparently when the block editor is involved) -- has been to use a custom filter.

Still makes me think the preg_replace() should come after the apply_filters( 'the_content', ....

janboddez avatar Jan 12 '24 09:01 janboddez

Considering that Mastodon and others now support <pre> etc. is it really nessecery to remove newlines and tabs?

I'm seeing additional newlines in a lot of (other people's) federated WordPress posts, so I'm thinking: Yes.

Probably.

But then, how Mastodon (and others] handles whitespace (space, tabs and newlines) in non-pre context might be more of problem that should be solved on the reciving end (i.e. Mastodon)?

Anyway, I primary just would like to be able to have:

Screenshot of Mastodon showing a code sample with spaces, tabs and newlines

instead of:

Screenshot of Mastodon showing a code sample without tabs and newlines

aslakr avatar Jan 12 '24 14:01 aslakr

Works wonders. I've just now updated it to not strip these characters from inside pre as outlined above, and that works as well (when I print $content to the debug log, it looks exactly as it should).

Haha, I'm now pretty sure I have a second, older callback cleaning up the remaining newline characters. ~~Gonna try in a sec.~~ Edit: Yes! (https://indieweb.social/@[email protected]/111731925084505754) ~~A single newline too many at the end of the pre, I can live with that (for now).~~

[H]ow Mastodon (and others] handles whitespace (space, tabs and newlines) in non-pre context might be more of problem that should be solved on the reciving end (i.e. Mastodon)?

As someone who's written a couple feed readers (and thus has had to parse and sanitize others' markup): it's a pain. What's worked for me, for "cleaning up" random bits of HTML, is strip as much whitespace chars as possible and then run the result through WordPress' wpautop(). Mastodon takes a different approach, and how some 3rd-party clients treat whatever Mastodon gives them yet another one ... Not much we can do, I'm afraid. But then I've mostly gotten it to work (through a bit of a hack, i.e., custom PHP).

janboddez avatar Jan 12 '24 20:01 janboddez

OK, so this seems to work for me (i.e., this is essentially what's currently in my custom activitypub_the_content callback, where I also use my own list of allowed tags):

$content = apply_filters( 'the_content', $post->post_content );

$content = wp_kses( $content, $allowed_tags );

// Strip whitespace, but ignore `pre` elements' contents.
$content = preg_replace( '~<pre[^>]*>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s', '', $content );

// Trim newlines off of (`code` elements inside) `pre` elements.
if ( preg_match_all( '~<pre[^>]*><code[^>]*>(.*?)</code></pre>~s', $content, $matches ) ) {
	foreach ( $matches[1] as $match ) {
		$content = str_replace( $match, trim( $match ), $content );
	}
}

And then I return that. (I completely discard what the plugin generates based on the [ap_content] etc. template in its settings! Note that I started doing this mainly because I want to return a different "content" for different post types. E.g., include a title for longer posts, but not include one for "notes.")

Still gotta test in Tusky, but looks good in Mastodon's web UI.

My custom "ActivityPub add-on plugin": https://gist.github.com/janboddez/58a2f3d2c86717cd799048af651fa6b4#file-activitypub-php (Note: there's a few things in there that are no longer need, gotta clean up one time.)

janboddez avatar Jan 12 '24 20:01 janboddez

Feel free to open a PR if you have a better solution that the actual!

pfefferle avatar Jan 12 '24 21:01 pfefferle

Does Wordpress have a build in function to do something similar to:

tidy --indent no --wrap --vertical-space yes --show-body-only yes

I suspect one can't rely on https://www.php.net/manual/en/book.tidy.php being available? And the new HTML API can't yet do something similar? Maybe later.

In meantime replacing /[\n\r\t]/ with @janboddez' ~<pre[^>]*>.*?</pre>(*SKIP)(*FAIL)|\r|\n|\t~s seems to work.

aslakr avatar Jan 13 '24 17:01 aslakr

Tusky still rendering an extra empty "paragraph" at the start of the blockquote in https://indieweb.social/@[email protected]/111772889651276747. Fairly sure it's a Tusky issue; the culprit seems to be a space between the opening <blockquote> and <p> tags, which I think is there because this particular post uses a "Classic" block. Can't think of a reliable workaround other than maybe https://github.com/derUli/html-minifier, which is probably not something you want to include in the plugin itself, as it is yet another dependency.

janboddez avatar Jan 18 '24 08:01 janboddez

Feel free to open a PR if you have a better solution that the actual!

I'll have another look at this, I think we should (?) at least try to sanitize/filter comments and posts more or less the same way.

Additional filtering is probably best left up to site authors/developers.

janboddez avatar Jan 18 '24 08:01 janboddez

Case anyone's still interested (or even following), I moved my activitypub_the_content callback to https://github.com/janboddez/share-on-mastodon-addon/blob/8e56c5c5b5416c1ba5e19dc9544042cca3ba4c83/includes/class-activitypub.php, i.e., another and previously unrelated "add-on" plugin of mine. This allowed me to add the aforementioned dependency (a so-called mu-plugin is nice on its own but doesn't work all that well once multiple files and folders are involved).

Not saying y'all should start using it, quite the contrary!

But: if anyone wanted to implement a custom content filter that strips all superfluous whitespace (and thereby having Mastodon, Tusky, and, hopefully, all other clients, render posts just like they would "native" Mastodon posts) while respecting text inside pre elements, feel free to borrow from the code.

Now if only I can also filter comments this way, I'm going to be very happy. 8-)

janboddez avatar Jan 18 '24 08:01 janboddez

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 18 '24 01:05 github-actions[bot]