ftr-site-config
ftr-site-config copied to clipboard
Undocumented patterns
I'm trying to add a kind of test validations for siteconfig to avoid mistake in them and I found some undocumented pattern in some of them (following that, I'll submit some fixes for others files).
Here is the list:
- convert_double_br_tags, for example .blog.163.com.txt
- strip_comments, for example .blogspot.com.txt
- move_into, for example 500px.com.txt
- autodetect_next_page, for example 5by5.tv.txt
- dissolve, for example acroswing.fr.txt
- native_ad_clue, for example arstechnica.com.txt
- footnotes, for example blogs.msdn.com.txt
- wrap_in, for example blogs.smithsonianmag.com.txt
- if_page_contains, for example gamasutra.com.tx
- single_page_link_in_feed, for example techmeme.com.txt
I was wondering if these patterns are absolete, new, unsued, etc. I can't find them in the documentation nor in the current open source version of Full-Text RSS. Have they been introduced in the current version of Full-Text RSS? (which means we can't see how they are handled)
Let me know :slightly_smiling_face:
Hey, thanks for the list. Most of these are carried over from Instapaper when I imported their site rules. They no longer have them public, but it used to be open for anyone to contribute (like this repository). I didn't implement all their directives, so most of these will just be ignored. Here's the list from Instapaper (at least I think all of these are from them, some might be users experimenting/guessing):
- convert_double_br_tags
- strip_comments
- move_into
- autodetect_next_page
- dissolve
- footnotes
- wrap_in
Of these, I'd like to implement dissolve. I think that removes the containing element without removing the contents. Would've been useful for that French site which had special links for regular words (linked to a dictionary I think). We ended up with a somewhat hacky solution. But dissolve would've come in useful.
These others are implemented in Full-Text RSS:
native_ad_clue Introduced in Full-Text RSS 3.4. Used to identify if a given article is a native ad. Ad Detector has a lot of rules.
if_page_contains
Introduced in Full-Text RSS 3.5. This is only used with single_page_link at the moment. Added to make single_page_link directives conditional. Sometimes these rules use XPath functions like concat
, like in the example you linked to:
single_page_link: concat(//meta[@property="og:url"]/@content, '?print=1')
if_page_contains: //a[contains(@class, "articleNav")]
Here, single_page_link will always return a string, so even if the meta element doesn't exist, you'll get '?print=1'. For some sites, the single page view is only available on multi-page articles. When constructing URLs like this, we need a way to make it conditional. Otherwise we'd end up redirecting to a non-existent page, or simply unnecessarily requesting another page when the current one contains everything we need. So that's what if_page_contains does at the moment.
single_page_link_in_feed This one should be documented, but it's not widely used. Basically the same as single_page_link but applied to the original feed item's description. So safe to ignore if the input URL is not a feed. See this question and our help page.
Of these, I'd like to implement dissolve.
It might be a good idea. From what I understand, it'll flatten the target node? Like:
<ul>
<li>
<div>my text</div
</li>
<ul>
If I've dissolve: //ul/li
, it'll turn the node into :
<div>my text</div
Am I right?
Thanks for the explanation on pattern implemented in Full-Text RSS.
For the unused list, maybe we can just remove them from siteconfig to avoid confusion?
- convert_double_br_tags
- strip_comments
- move_into
- autodetect_next_page
- footnotes
- wrap_in