ftr-site-config Undocumented patterns

I'm trying to add a kind of test validations for siteconfig to avoid mistake in them and I found some undocumented pattern in some of them (following that, I'll submit some fixes for others files).

Here is the list:

convert_double_br_tags, for example .blog.163.com.txt
strip_comments, for example .blogspot.com.txt
move_into, for example 500px.com.txt
autodetect_next_page, for example 5by5.tv.txt
dissolve, for example acroswing.fr.txt
native_ad_clue, for example arstechnica.com.txt
footnotes, for example blogs.msdn.com.txt
wrap_in, for example blogs.smithsonianmag.com.txt
if_page_contains, for example gamasutra.com.tx
single_page_link_in_feed, for example techmeme.com.txt

I was wondering if these patterns are absolete, new, unsued, etc. I can't find them in the documentation nor in the current open source version of Full-Text RSS. Have they been introduced in the current version of Full-Text RSS? (which means we can't see how they are handled)

Let me know :slightly_smiling_face:

Feb 01 '17 13:02 j0k3r

Hey, thanks for the list. Most of these are carried over from Instapaper when I imported their site rules. They no longer have them public, but it used to be open for anyone to contribute (like this repository). I didn't implement all their directives, so most of these will just be ignored. Here's the list from Instapaper (at least I think all of these are from them, some might be users experimenting/guessing):

convert_double_br_tags
strip_comments
move_into
autodetect_next_page
dissolve
footnotes
wrap_in

Of these, I'd like to implement dissolve. I think that removes the containing element without removing the contents. Would've been useful for that French site which had special links for regular words (linked to a dictionary I think). We ended up with a somewhat hacky solution. But dissolve would've come in useful.

These others are implemented in Full-Text RSS:

native_ad_clue Introduced in Full-Text RSS 3.4. Used to identify if a given article is a native ad. Ad Detector has a lot of rules.

if_page_contains Introduced in Full-Text RSS 3.5. This is only used with single_page_link at the moment. Added to make single_page_link directives conditional. Sometimes these rules use XPath functions like concat, like in the example you linked to:

  single_page_link: concat(//meta[@property="og:url"]/@content, '?print=1')
  if_page_contains: //a[contains(@class, "articleNav")]

Here, single_page_link will always return a string, so even if the meta element doesn't exist, you'll get '?print=1'. For some sites, the single page view is only available on multi-page articles. When constructing URLs like this, we need a way to make it conditional. Otherwise we'd end up redirecting to a non-existent page, or simply unnecessarily requesting another page when the current one contains everything we need. So that's what if_page_contains does at the moment.

single_page_link_in_feed This one should be documented, but it's not widely used. Basically the same as single_page_link but applied to the original feed item's description. So safe to ignore if the input URL is not a feed. See this question and our help page.

Feb 01 '17 23:02 fivefilters

Of these, I'd like to implement dissolve.

It might be a good idea. From what I understand, it'll flatten the target node? Like:

<ul>
  <li>
    <div>my text</div
  </li>
<ul>

If I've dissolve: //ul/li, it'll turn the node into :

    <div>my text</div

Am I right?

Thanks for the explanation on pattern implemented in Full-Text RSS.

For the unused list, maybe we can just remove them from siteconfig to avoid confusion?

convert_double_br_tags
strip_comments
move_into
autodetect_next_page
footnotes
wrap_in

Feb 02 '17 09:02 j0k3r