soupault icon indicating copy to clipboard operation
soupault copied to clipboard

Schema.org structured data support (example with org-mode file).

Open MorphicResonance opened this issue 3 years ago • 5 comments

Processing various metadata from org-mode is part of my trial to create org structured, "content first" and search engine friendly web pages. But the question is bigger and for all users. With soupault we choose the conventional way of html formatting for design and abandoned templates, their variations and other opinionated hints. The widely used and conventional way of machine executable content markup is developed by schema.org. After some progress think here is simple requirements for the plugin that should transfer metatags into <head> section of web page.

  1. Data for metatags like title and meta-description
  2. Machine-executable data for search engine robots as json-ld.

the good news is that it's likely to be possible to convert the just input text without having to write a separate yaml block for json-ld. I talked to the developers from stencil, they'll took care of it.

There is only the 1st task with extraction data for meta tags. And the second item is decided by the converter. So,

  • if user do not choose json-ld into his web page, but only microdata formatting, then the second task is not need. Just extract metatags and run the converter to html.
  • if user choose json-ld format, then need to take it from the output of the converter and put it in the defined section of the page (usually in the <head>). extract metatags --> run converter to convert input into json--> run the converter again to convert input into html.

My case is without json-ld and operate with microdata. Just take note about json-ld case. this is input file: #+begin_example

#+meta_title: this is a title of the page
#+meta_description: this is a metadescription of the page
#+title: A simple Org Mode article for testing
#+author: Nokome Bentley

* Introduction

A simple Org Mode article for testing. When making changes please note
that test snapshots based on this fixture may need to be updated.

* Methods

This is the methods section.

* Results

The results include a table (Table 1).

| Group | Value |
|-------+-------|
| A     | 1.1   |
| B     | 2.2   |

* Discussion

This is the discussion section.

#+end_example

Plugin should take this is a title of the page from #+meta_title:. and this is metadescription of the page from #+meta_description: . If #+meta_title: is not exist then take data from the #+title: (it means that web page title and article title will have identical titles in this case).

Then delete these strings with #+meta_... completely and leave other as is (#+title: should be left). Other properties will be applied by converter for microdata markup.

Then place value of title/metadescription variables into title/metadescription tags of the page.

<head>
....
<title>{{meta_title}}</title>
<meta name="description" content="{{meta_description}}" />
....
</head>

this is basic version of the plugin since converting from org-mode to html by stencila is in development. But it is clear the way plugin should be written, don't think there will be much difference from above.

MorphicResonance avatar Nov 05 '21 21:11 MorphicResonance

I haven't forgotten your request.

Please remind me, the title field should do to the page <title> in its <head>, but what exactly do you want to do with other fields?

Ideally, I'd like to see examples of source pages in the Org format and hand-written mockups of output pages you want to produce from them.

dmbaturin avatar Feb 05 '22 07:02 dmbaturin

As soupault requires single file as potential page in input, the structure of input org-file should be similar .

  1. block of metatags and data for webpage escaped from pandoc
  2. body , paragraphs of text, other complex html blocks formatted as sheme objects & pandoc escaped. Escaping from pandoc provided by including our data into tags
#+BEGIN_EXPORT html
our data maybe html fromatted being escaped from pandoc. 
May include microdata with sheme objects. 
video , audio others for embedding into article body of the page template.
#+END_EXPORT

so lets provide our data from input .org file -----start of org file------

#+BEGIN_EXPORT html
<site-meta-data>
#+title: post 1 title
#+subtitle: Post 1 subtitle
#+description: Post 1 description
#+author: Billy
#+date: 2021-11-03
#+datepublished: 2021-06-02
#+usertags: fish, animal
#+summary: Post 1 summary
#+id: 1-test1com
</site-meta-data>
#+END_EXPORT

Fish are aquatic, craniate, gill-bearing animals that lack limbs with digits. 
They form a sister group to the tunicates, together forming the olfactores. 
Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. 
Around 99% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with over 95% belonging to the teleost subgrouping.
sentence.
** test heading 1
text 1
*** heading 2
text 2
Inermis indoctum vis in, has soleat complectitur te.

#+BEGIN_EXPORT
        <div itemprop="video" itemscope itemtype="https://schema.org/VideoObject">
          <video controls poster="/video/big_buck_bunny.jpg">
            <source itemprop="contentUrl" type="video/mp4" src="/video/big_buck_bunny.mp4">
            <source itemprop="contentUrl" type="video/webm" src="/video/big_buck_bunny.webm">
            I’m sorry, your browser doesn’t support HTML5 video in MP4 with H.264 or WebM with VP8/VP9.
          </video>
          <p><small>Video copyright 2008, Blender Foundation / www.bigbuckbunny.org.</small></p>
          <meta itemprop="name" content="Video example">
          <meta itemprop="description" content="An example HTML5 video file.">
          <meta itemprop="duration" content="T60S">
          <meta itemprop="uploadDate" content="2018-09-21T10:44:26Z">
          <meta itemprop="thumbnailUrl" content="/video/big_buck_bunny.jpg">
        </div>
#+END_EXPORT
** conslusiom
bye

------end of input org file---------

So after processing by pandoc org heading become html heading, escaped html will be present as it was and paragraphs will became html formatted <p>..</p>

For detecting end extracting data by soupault I inluded it into special tags <site-meta-data></site-meta-data> . I wish soupault extract it as array and include different by the goal data into metatags of the page template and body of the page also.

Basically block with the data between <site-meta-data></site-meta-data> contains data which are inserted into different places of template usually not into the article body part because we write body successively with different embeded scheme objects.

So if we apply such template with .org file as web page

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  </head>

let soupault append some of our extracted data as metatags for webpage. Title and metadesciptions have jumped into

section dates as such #+date: 2021-11-03, #+datepublished: 2021-06-02 jumped into scheme itemprops as datePublished and dateModified. "Post 1 summary" from #+summary: has jumped in section "introduction" itemprop="text". Billy from #+author: jumped into itemprop author->name.
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>post 1 title</title>
<meta name="description" content="Post 1 decription" />
  </head>`
<!-- .. some template frontmatter. --> 

<!-- Article -->             
<article itemprop="blogPost" itemscope itemtype="http://schema.org/BlogPosting">

  <header>
    <h1 itemprop="name headline">post 1 title</h1>
    <p class="post-meta">
      by <span itemprop="author" itemscope itemtype="http://schema.org/Person">
        <span itemprop="name">Billy</span>
      </span>
      <meta itemprop="datePublished" content="2021-06-02">      
      <time itemprop="dateModified" datetime="2021-11-03">Nov 11, 2021</time>     
    </p>
  </header>

  <div class="b0" itemprop="articleBody">
    <section class="introduction"><p itemprop="text">Post 1 summary</p></section> 
    <section class="content">
<p>Fish are aquatic, craniate, gill-bearing animals that lack limbs with digits. They form a sister group to the tunicates, together forming the olfactores. Included in this definition are the living hagfish, lampreys, and cartilaginous and bony fish as well as various extinct related groups. Around 99% of living fish species are ray-finned fish, belonging to the class Actinopterygii, with over 95% belonging to the teleost subgrouping. sentence.</p>
<h2 id="test-heading-1">test heading 1</h2>
<p>text 1</p>
<h3 id="heading-2">heading 2</h3>
<p>text 2</p>
<p>Inermis indoctum vis in, has soleat complectitur te.</p>

     <div itemprop="video" itemscope itemtype="https://schema.org/VideoObject">
          <video controls poster="/video/big_buck_bunny.jpg">
            <source itemprop="contentUrl" type="video/mp4" src="/video/big_buck_bunny.mp4">
            <source itemprop="contentUrl" type="video/webm" src="big_buck_bunny.webm">
            I’m sorry, your browser doesn’t support HTML5 video in MP4 with H.264 or WebM with VP8/VP9.
          </video>
          <p><small>Video copyright 2008, Blender Foundation / www.bigbuckbunny.org.</small></p>
          <meta itemprop="name" content="Video example">
          <meta itemprop="description" content="An example HTML5 video file.">
          <meta itemprop="duration" content="T60S">
          <meta itemprop="uploadDate" content="2018-09-21T10:44:26Z">
          <meta itemprop="thumbnailUrl" content="/video/big_buck_bunny.jpg">
        </div>
<h2 id="conslusion">conslusiom</h2>
<p>bye</p>
</section>
  </div>

  <footer class="b3">
    <p class="post-meta">
      share buttons
    </p>
  </footer>
</article>
...
<footer></footer>

So as we are using template for dominating content style and inserting some own data into it. Note not all the data from input (id, usertags) were used in this template and this case therefore user should be able to define what extracted named parts of array soupault will insert into template and where.

MorphicResonance avatar Feb 06 '22 22:02 MorphicResonance

Could you confirm or deny the following: an org-mode metadata entry will always start with #+, will always contain a string, and will always end with a newline? That is, will #\+(.*)\n be a safe regex for extracting metadata entries?

Since soupault 4.0.0 supports a pre-parse hook, it's now possible to reimplement various types of front matter with that hook. Since that hook works on the page source before it's parsed and before it's decided whether it will be indexed or not, it will also have to produce text.

Does something like this look good to you? I assume the plugin should always put the rendered HTML before the page body. Let me know what you think.

[hooks.pre-parse]
  file = "hooks/org-mode-metadata.lua"
  template = """
    <h1 id="post-title">{{title}}</h1>
    ...
  """

dmbaturin avatar Apr 06 '22 09:04 dmbaturin

Yes metatags always start with #+ and ended with newline.names from values are delimited a:. I don't see how it can be done with pre-parse hook since we need extract values for metatags, save them to somekind of global variables, delete these strings and send values from them into html tree then. So preparse hook is working only for deleting string with metatags. I see the variant with render as unified version of pandoc's "in the middle" lua filters. But it just the same dance with fake tags as I wrote long time ago.

MorphicResonance avatar Jul 29 '22 21:07 MorphicResonance

Yes metatags always start with #+ and ended with newline.names from values are delimited a:. I don't see how it can be done with pre-parse hook since we need extract values for metatags, save them to somekind of global variables, delete these strings and send values from them into html tree then. So preparse hook is working only for extracting/deleting string with metatags. I see the variant with render as unified version of pandoc's "in the middle" lua filters. But it just the same dance with fake tags as I wrote long time ago.

MorphicResonance avatar Aug 05 '22 00:08 MorphicResonance