Simple-Instant-Articles-for-Facebook icon indicating copy to clipboard operation
Simple-Instant-Articles-for-Facebook copied to clipboard

Reformatting content

Open mattheu opened this issue 9 years ago • 10 comments

Facebook is a bit picky about its content, and if the article doesn't match the required markup - it complains. Sometimes it throws that bit out, sometimes it fixes it (but shows a warning).

Examples:

  • All images have to be a top level element and must be wrapped in a <figure> element. Otherwise not shown.
  • All headings must be h1 or h2. If lower than this, they are changed to h2. Works OK but shows a warning.
  • Empty

    tags show warnings. wp_auto_p seems to like adding these.

So what do we do about this?

I'm not sure whether the plugin should proactively 'fix' these issues or whether it should be left up to the user to filter their content and ensure it meets the specification.

We've taken the decision to fix it, and are using DomDocument to parse the content. I think this is a more robust way to handle the modification of the content, but again, might not be appropriate for the plugin.

We're doing something along these lines:

    /**
     * Filter the_content,
     * convert to DOMDocument object
     * allow anything to hook in and modify
     * convert back to HTML and return.
     */
    public function reformat_post_content( $post_content ) {

        $dom = new \DOMDocument();

        // Parse post content to generate DOM document.
        // Use loadHTML as it doesn't need to be well-formed to load.
        @$dom->loadHTML( '<html><body>' . $post_content . '</body></html>' );

        // Stop - if dom isn't generated.
        if ( ! $dom ) {
            return;
        }

        // Allow stuff to hook in here and modify the $dom object directly
        do_action_ref_array( 'simple_fb_reformat_post_content', array( &$dom ) );

        // Convert back to HTML
        $html = '';
        $body_node = $dom->getElementsByTagName( 'body' )->item( 0 );

        foreach ( $body_node->childNodes as $node ) {
            $html .= $node->ownerDocument->saveHTML( $node );
        }

        return $html;
    }

I'd be interested in hearing your thoughts on this. Its one of those things - chose NOT to support and the issue will always come up. But its also a difficult thing to try and do well, for everybody's different content.

We've done a bunch of work on this so if its something your'e interested in I can whip up a pull request.

mattheu avatar Nov 18 '15 17:11 mattheu

I was hesitant to go down this route. I added a method to promote tags here, but this might be handy for doing large scale content swaps too... Something like move the header, update the gallery location and so on.

whyisjake avatar Nov 18 '15 20:11 whyisjake

If you wanted to whip up a PR, that would be awesome @mattheu.

whyisjake avatar Nov 18 '15 20:11 whyisjake

@mattheu - I'd love to see the work you did on formatting images

jetlej avatar Feb 17 '16 16:02 jetlej

Yeah its all a bit hacky!

We need to catch a few cases. I believe galleries and caption shortcodes are handled already, but this leaves image HTML in the post content - which is the default created when you insert an image.

We need to wrap img elements in <figure>. Also ensure they are top level elements and not within paragraphs or divs. Using the code I posted above to provide an action that lets us manipulate the DOM object the code I used is this:

// Get all images that are not children of figure already.
foreach ( $xpath->query( '//img[not(parent::figure)]' ) as $node ) {

    $figure   = $dom->createElement( 'figure' );
    $top_node = $node;

    // If image node is not a direct child of the body, we need to move it there.
    // Recurse up the tree looking for highest level parent/grandparent node.
    while ( $top_node->parentNode && 'body' !== $top_node->parentNode->nodeName ) {
        $top_node = $top_node->parentNode;
    }

    // Insert after the parent/grandparent node.
    // Workaround to handle the fact only insertBefore exists.
    try {
        $top_node->parentNode->insertBefore( $figure, $top_node->nextSibling );
    } catch ( \Exception $e ) {
        $top_node->parentNode->appendChild( $figure );
    }

    $figure->appendChild( $node );
}

Hope this is helpful

mattheu avatar Feb 17 '16 16:02 mattheu

@mattheu - Extremely helpful. Thank you so much!

jetlej avatar Feb 17 '16 16:02 jetlej

@mattheu - Where is the value of $xpath coming from? The simple_fb_reformat_post_content action you originally posted only includes the $dom arg.

Here's the function I created that hooks into reformat_post_content (which was added as a filter to the_content):

function fb_instant_format_images(&$args){

    $dom = $args[0];

    // Get all images that are not children of figure already.
    foreach ( $xpath->query( '//img[not(parent::figure)]' ) as $node ) {

        $figure   = $dom->createElement( 'figure' );
        $top_node = $node;

        // If image node is not a direct child of the body, we need to move it there.
        // Recurse up the tree looking for highest level parent/grandparent node.
        while ( $top_node->parentNode && 'body' !== $top_node->parentNode->nodeName ) {
            $top_node = $top_node->parentNode;
        }

        // Insert after the parent/grandparent node.
        // Workaround to handle the fact only insertBefore exists.
        try {
            $top_node->parentNode->insertBefore( $figure, $top_node->nextSibling );
        } catch ( \Exception $e ) {
            $top_node->parentNode->appendChild( $figure );
        }

        $figure->appendChild( $node );
    }
}
add_action('simple_fb_reformat_post_content', 'fb_instant_format_images');

jetlej avatar Feb 17 '16 17:02 jetlej

Do you get figure markup from add_theme_supprt('html5')?

whyisjake avatar Feb 17 '16 17:02 whyisjake

@whyisjake - I have added that to my theme, but I still have <img>'s inside markup like this: <p><a><img/></a></p>

jetlej avatar Feb 17 '16 17:02 jetlej

Oh its just $xpath = new \DOMXPath( $dom );

mattheu avatar Feb 17 '16 17:02 mattheu

:+1: This is a huge issue for us and is blocking forward progress. We still have some images being wrapped in <p> tags. I'm going to attempt the modification suggested above, but so far I've had a lot of trouble attempting DOM manipulation.

tdlm avatar Mar 11 '16 01:03 tdlm