mynt icon indicating copy to clipboard operation
mynt copied to clipboard

Excerpt Possibly Incorrect

Open andrew-d opened this issue 12 years ago • 19 comments

Hello,

The "excerpt" feature seems to be not working as I would expect - not sure if this is intended or not. Specifically, it seems to only work if the first <p> tag starts at the very beginning of the post. Here's an example - if you take the regex here here's what you get:

>>> re.search(r'\A.*?(?:<p>(.+?)</p>)?', "<p>Excerpt at beginning</p> more text <p>blah</p>", re.M | re.S).groups()
('Excerpt at beginning',)
>>> re.search(r'\A.*?(?:<p>(.+?)</p>)?', "Stuff before the <p>Excerpt at beginning</p> more text <p>blah</p>", re.M | re.S).groups()
(None,)

If this isn't intended, I can try and fix it, if you want.

Also, a bit of a side-note: the regex is case-sensitive, so it doesn't work with upper-case <p> tags. This isn't so important, though.

>>> re.search(r'\A.*?(?:<p>(.+?)</p>)?', "<P>Excerpt at beginning</P> more text <p>blah</p>", re.M | re.S).groups()
(None,)

Thanks!

andrew-d avatar Mar 18 '12 21:03 andrew-d

It definitely needs improvement.

Ideally you'd just be able to define the excerpt in a non hideous manner. Something like an inline Jinja tag or using the yaml front matter are ideas that come to mind but are equally as bad if not worse.

The issue with it only grabbing a paragraph if it's at the start is it may not be convenient with the layout of your posts. The upside being you have more control on what excerpt contains.

With it grabbing the first paragraph it sees no matter where it is the issue becomes losing the ability to have an empty excerpt. The upside being it may fit better with the layout of your posts.

I think rather than just tweaking the regex a better solution altogether is needed and I'm open to any suggestions.

uhnomoli avatar Mar 19 '12 02:03 uhnomoli

I think a simple solution would be using the YAML frontmatter. I looked into writing a Jinja2 extension, but I can think of too many problems for me to like that solution. Specifically - it takes the excerpt away from the actual document markup and makes it a part of the templating system, which doesn't seem like a good idea to me.

If there was something like "excerpt_delim" in the frontmatter, mynt could parse the generated code and set the excerpt to be everything before that delimiter, and then strip the delimiter from the generated HTML.

The issue I see with this solution is that the delimiter then needs to be chosen by the user, and you run into two issues:

  • If the delimiter-stripping is done before the markdown is rendered (i.e. you render the excerpt and the whole post separately), you run into the issue of stuff like footnotes, and so on, not working.
  • If the delimiter-stripping is done after the markdown is rendered, you get issues like the delimiter accidentally conflicting with the HTML that's generated

Not to mention, I'm not a huge fan of making the user choose the delimiter. I'll put some more thought into this - it's a nontrivial problem, for sure.

andrew-d avatar Mar 19 '12 02:03 andrew-d

Yeah, I was thinking of a similar idea sort of hijacking a misaka renderer callback. Something like use the fenced code block but supply a special value for the language to identify it as an excerpt. Not sure how I feel about it as it's still rather hackish and is misaka specific which defeats the idea of mynt being renderer/parser independent.

Maybe the use of a rare HTML tag like <hr> to mark the end of the excerpt? Then provide a setting that's configurable on a post by post basis that would disable said behavior for that post in the case someone wants to actually use an <hr>?

uhnomoli avatar Mar 19 '12 03:03 uhnomoli

Yeah, I'm a fan of the parser-independence, so I'm not a fan of anything that ties into misaka / jinja2.

I came up with another idea, too. Something like this:

---
...
excerpt_end: "of the excerpt."
---

This is the first paragraph of the post.  It should contain some content, some other introductory statements, and then a hook to convince the reader to continue reading.  And this is the end of the excerpt.

This text will be shown in the main post, and is not part of the excerpt.

The idea being, the post will be parsed, and then mynt looks for the first instance of the stuff in quotes ("of the excerpt."), and grabs everything before that as the excerpt. It might throw a warning if it notices two instances of the search string, like in the above example at the end of the 2nd paragraph. The advantage of this: it's simple, and completely independent of how the content is parsed / rendered. Perhaps not the most elegant, though.

Also, I'd hesitate to use <hr>; I use that in my website for layout purposes, for example. Maybe something that no self-respecting person would ever use, like <marquee> or <blink>?

andrew-d avatar Mar 19 '12 04:03 andrew-d

Not really sure I like the excerpt_end idea. Requires a bit of work on the user's part to have to define that for every post as there isn't really a sane default and it's really not all that intuitive.

As for the <hr> idea, it'd only matter if you used <hr>s in your posts (i.e. the Markdown file). The reason I kind of liked it was it kind of makes sense semantically and Markdown (as well as most markup languages) has it's own syntax for <hr>s so it's pretty clean.

A problem I could see with just using some random tag/string is other renderers may be configured to strip certain/invalid HTML. Although I can see this not being very elegant either as some people may use <hr>s in their posts and would then render them unable to use excerpts.

Tricky problem indeed :s

uhnomoli avatar Mar 19 '12 04:03 uhnomoli

Yeah, good point on the lack of a sane default for excerpt_end.

As for the <hr> - using that makes a lot of sense, the more I think about it. Especially since something like reStructuredText can make one by using "4 or more repeated punctuation characters", according to the documentation. I wouldn't mind seeing an option to have the parser leave the generated <hr> in the text, as opposed to stripping it out. Stripping it should probably be the default, though.

You'd have to handle <hr>, <hr/> and <hr />, case-insensitively, but it's also fairly simple - I like it :)

andrew-d avatar Mar 19 '12 04:03 andrew-d

If you want to do it that way, you probably want something like this:

excerpt = re.search(r'\A(.*?)<hr ?/?>', content, re.M | re.S | re.I).group(1)

If you want it fancy and accepting <hr> and <hr /> but not <hr >, then this would work:

excerpt = re.search(r'\A(.*?)<hr(?:(?: )?/)?>', content, re.M | re.S | re.I).group(1)

andrew-d avatar Mar 19 '12 04:03 andrew-d

Hrm, just thought of an issue with the <hr> idea. Recently on request, and it makes sense, I changed excerpt to not be wrapped in HTML. Bit trickier with this method to grab the excerpt not wrapped in HTML and depending on the method could lead to some confusion.

Not sure if you're any more familiar than I am (not very) with any of the popular CMSs, but do you know how they handle excerpts? Any worthy ideas to pull from there?

uhnomoli avatar Mar 19 '12 05:03 uhnomoli

I'm not incredibly familiar with the way that other CMSs do it, but some quick Googling gives me this: http://codex.wordpress.org/Excerpt

Specifically, it lets you specify the excerpt yourself, or it grabs the first 55 words from the post, or it looks for the <!--more--> HTML comment. Since Markdown actually passes HTML comments through (it does when I try it here, anyway), I like that idea. It's non-intrusive, and seems to be relatively parser-independent. reST seems to have issues with the double-dashes, so perhaps something like <!-- excerpt_break --> would work? So, in Markdown:

This is the excerpt!

<!-- excerpt_break --> 

This is the rest of the post's content.

EDIT: And the regex would be something like: r'\A(.*?)<!-- *?excerpt_break *?-->'

andrew-d avatar Mar 19 '12 05:03 andrew-d

Yeah, that's pretty much the same thing as the <hr> just with a comment instead. I guess it is a bit better as it doesn't rely on someone not using a tag, not sure how most parsers handle HTML comments though. Also still has the issue of having to grab the excerpt not wrapped in HTML and that could lead to some confusion depending on the implementation.

Would be nice to get some more input on this. I'm about to push out 0.2 which adds a default theme and init, watch, and serve subcommands which should make mynt a bit more accessible so maybe some newcomers will see this and have some ideas.

For now I think the <hr> or HTML comment idea is the best, though I'm not sure what the least confusing manner of grabbing the excerpt not wrapped in HTML would be.

uhnomoli avatar Mar 19 '12 07:03 uhnomoli

Ok, maybe it's just me, but I'm not sure I understand what you mean by "not wrapped in HTML". Can you clarify what you mean?

andrew-d avatar Mar 19 '12 07:03 andrew-d

I mean the excerpt should never contain any block level elements.

uhnomoli avatar Mar 19 '12 07:03 uhnomoli

It would also be great to allow the possibility to set the start of the excerpt as well. Using comments that would be easy to accomplish with:

<!-- excerpt_start -->
My excerpt.
<!-- excerpt_end -->

If no excerpt_start, parse from the beginning of the dockument until the excerpt_end. If no excerpt_end, parse xxx words from the beginning (where x could be a setting and with a default value of 50 words or something).

xintron avatar Sep 20 '12 05:09 xintron

The issue with using comments, as discussed above, is grabbing the excerpt not wrapped in HTML. I imagine the only time you'd want to use the start/end comments would be when you want an excerpt that's longer than 1 paragraph. Which means the excerpt would have to be wrapped in HTML. So then sometimes you'd be getting an excerpt that is wrapped in HTML and sometimes not.

The default is, and probably always will be, the first paragraph of the post. If someone wants the first x words of the post they can just use the truncate filter: {{ post.content|truncate(x) }}.

uhnomoli avatar Sep 20 '12 18:09 uhnomoli

The issue I wanted to lift was to give the possibility to set the start for the excerpt since you might not want it to be read from the beginning of the post but start at a sentence in the middle or anywhere in the post.

xintron avatar Sep 21 '12 07:09 xintron

I guess that just seems odd to me because I always viewed the excerpt serving as a summary and it doesn't make much sense to have a summary of a post anywhere other than at the beginning.

uhnomoli avatar Sep 21 '12 08:09 uhnomoli

But don't you usually put the summary at the end of a post instead of in the header? Anyway it might be good to have this in mind when discussing a solutiong for the excerpt.

xintron avatar Sep 21 '12 08:09 xintron

I need to stop using GitHub when I'm just about to go to sleep or have just woken up like now :s

Sorry, summary was somewhat the wrong word and I'm drawing a blank for a better one. I feel the excerpt serves to give the reader a brief overview of what the post is about to help them decide if they want to read the entire post or not. I don't know why a paragraph in the middle of a post would do a better job of that then the first.

I suppose an argument could be made for using the last paragraph. Would an option to chose between the first or last paragraph be sufficient?

This issue pretty much is the discussion of a solution :p I'm open to any and all ideas. We've only really had 2 solutions come up and I'm not really fond of either. I think to really tackle this issue, we need more information. Specifically, where do people want the excerpt to come from and how do they want that excerpt to be handled.

uhnomoli avatar Sep 21 '12 19:09 uhnomoli

Hello, I'm glad to find a discussion about this already.

Regarding automatic excerpt: It is not common to write blog post summaries anywhere. The first paragraph however is commonly a good representation as it should be an introduction.

When in doubt I suggest to reproduce http://jekyllrb.com/docs/posts/#post-excerpts behavior.

I actually came to report the issue that .excerpt produces "None" when the first line of the post contains a header (which might be bad style, which is probably why it didn't get noticed) and providing an image if the first line of the post contains an image (which might be desired though). These issues are currently showing up on pyladies.com/blog

EDIT: I just realized that if the first line is surrounded by "*" (making it italic), the paragraph also produces "None".

qubodup avatar Oct 05 '14 16:10 qubodup