Figure out mf2 h-feed authorship
Source: HTML
Target: Atom/XML
Example: https://granary.io/url?input=html&output=atom&url=https://news.indieweb.org/en
Expected feed author: IndieNews en @ https://news.indieweb.org/en
Actual feed author: The first h-card on the page
Note: The feed id and title is correct, but the <author> element is not.
Suggested solution: Granary should follow the representative-h-card-parsing algorithm, and if no h-card is found then use <title> and page URL as the author, instead of incorrectly assuming the first h-card is the page's author.
thanks for filing! granary currently uses the authorship algorithm to find the feed author, but that's evidently for posts, not feeds. so i guess you're right, maybe i should use the h-feed's p-author mf2 property first, and if not provided, fall back to representative h-card.
lots more discussion on this recently on #indieweb-dev and on #microformats, but no conclusion. basically, we don't yet have an "authoritative" way to determine an h-feed's author, at least if it doesn't have an explicit p-author property. representative h-card and authorship algorithm are both related, but neither is the exact answer.
@tantek's comments here are perhaps the closest thing to a conclusion: basically, we still need to do some research and come up with an algorithm. we don't necessarily have the "right" one just yet.
snarfed, h-feed authorship is an interesting problem and worth researching & brainstorming properly rather than seeing if h-entry approaches “just work” because that may be overdoing it Better to collect examples (links, analysis) of h-feed elements that you’re trying to parse and analyze them to figure out a minimum algorithm based on examples The “XML approach” would be to assume / require authors/publishers always use an author property and then “just” look for that. While a good starting point, it’s obviously a bad approach to optimize for developer convenience rather than researching reasonable real world examples and making sure to handle them It’s also a bad approach to “just try” some other similar algorithm to see if it “just works” as you’re likely making all sorts o bad assumptions by doing so So I disagree with both “just use representative h-card” and “just use h-entry authorship but for h-feed” There’s no shortcut here. If you want a good algorithm it has to start with documenting & analyzing real world publishing examples
i'm not necessarily going to take on researching and creating this new h-feed authorship algorithm, but i will take two todos here:
- file an issue somewhere to track its need, maybe in microformats/microformats2-parsing
- switch granary to do something better, or at least "less wrong," in the meantime
Here is the algorithm I am using to parse feed author in the wild, quoted from indieweb/authorship/issues/4:
- If h-feed with p-author, author is p-author.
- If h-feed with u-url, and that URL has h-card matching u-url, author is that h-card.
- If h-feed with u-url, and that URL has no h-card matching u-url, author URL is u-url and name is page
. - If h-feed with no u-url or p-author, author URL is page URL and name is page
. - If no h-feed then no feed author.
This would at least fix the example feed parsing for this issue, setting the author to be "IndieNews en @ news.indieweb.org/en"
i've taken a stab at this in 8e190da85b053c0bff287cc237806d880233ef40, but it's an ugly refactoring and nowhere near usable yet, and i don't see a clear path to get it merged. open to other thoughts or attempts!