php-mf2 icon indicating copy to clipboard operation
php-mf2 copied to clipboard

Experimental language parsing

Open voxpelli opened this issue 9 years ago • 8 comments

It would be valuable to get a working proof of concept of language parsing built for one of the mf2-parsers and the php-mf2 library along with the javascript one are two good candidates for that.

The discussion around language parsing is happening here: http://microformats.org/wiki/microformats2-parsing-brainstorming#Parse_language_information

There's a similar issue as this in the javascript MF2 parser here: https://github.com/glennjones/microformat-shiv/issues/22 And the original PR to create proof of concept for an old version of the javascript mf2 parser can be found here: https://github.com/glennjones/microformat-node/pull/23

To achieve the language parsing in php-mf2 one can probably utilize the fact that a DOMNode has a parentNode property (see docs) and use that to traverse the document tree upwards until one reach the first lang= attribute or one reaches the end of the tree. Then one knows what the language of a node is (apart from some defaults that may have been specified in the eg. the HTTP-response, see HTML5 docs) and one can then know whether to add the language attribute or not.

Update: As @gRegorLove pointed out on IRC it may be hard to add the proposed output without breaking backwards compatibility, so the new output would either have to be introduced as a new major version or, probably preferably, as an opt-in feature flag for now that those who wants to use language data here and now can use while those who prefer to wait for a future major version before updating to support the new output could do so.

voxpelli avatar May 07 '16 20:05 voxpelli

I'm interested in working on this as i'm trying to add mf2 parsing to https://github.com/fguillot/picoFeed and it currently supports language detection for XML feeds.

Recent conversation: https://indiewebcamp.com/irc/2016-05-07#t1462646589527

A tricky scenario that @voxpelli raised with nested p-* and languages specific to them: https://indiewebcamp.com/irc/2016-05-07#t1462651125104

gRegorLove avatar May 07 '16 20:05 gRegorLove

@gRegorLove I'm looking at the parsed result and it looks like it's including an html-lang property in the wrong place.

<div class="h-entry" lang="sv" id="postfrag123">
  <h1 class="p-name">En svensk titel</h1>
  <div class="e-content" lang="en">With an <em>english</em> summary</div>
  <div class="e-content">Och <em>svensk</em> huvudtext</div>
</div>
{
    "type": [
        "h-entry"
    ],
    "properties": {
        "name": [
            "En svensk titel"
        ],
        "content": [
            {
                "html": "With an <em>english<\/em> summary",
                "value": "With an english summary",
                "html-lang": "en"
            },
            {
                "html": "Och <em>svensk<\/em> huvudtext",
                "value": "Och svensk huvudtext",
                "html-lang": "sv"
            }
        ],
        "html-lang": "sv"
    }
}

The html-lang property in the content is correct, but there's also an html-lang property inside properties which isn't what's described on the brainstorming page.

aaronpk avatar May 27 '17 14:05 aaronpk

Yeah ... had to solve this locally as well yesterday (kept busting interating over the properties by not providing an array).

jkphl avatar May 27 '17 15:05 jkphl

I am moving the language parsing behind a feature flag until this is sorted out. That way you can opt in to have the language parsing happen, but must be aware that it's still experimental.

aaronpk avatar May 27 '17 15:05 aaronpk

Ok. I'm generally interested as other formats support languages as well. Still working on implementing it though.

jkphl avatar May 27 '17 15:05 jkphl

Oops. I'll add some explicit tests for that and work on the fix.

gRegorLove avatar May 27 '17 21:05 gRegorLove

Fixed in #124!

I'll push out a new release with this change once #112 is done too!

aaronpk avatar May 27 '17 22:05 aaronpk

@aaronpk ~~Before you push out a new release, will need to switch back to "html-lang" per https://chat.indieweb.org/microformats/2017-05-30/1496166813294000~~

Edit: disregard. Per later conversation, "lang" doesn't appear at the same level as any mf properties in the parsed results, so shouldn't cause conflicts.

gRegorLove avatar May 30 '17 19:05 gRegorLove