ttrss_plugin-af_feedmod
ttrss_plugin-af_feedmod copied to clipboard
Add post-processing or more specific content selection
N24.de has additional content inside their main article DIV. There should be some filter or a more specific way of selecting the desired content to use.
I expect this can be done using a more extensively defined xpath query. Below some examples (not N24.de), which might be useful. Today is a slow news day, so I don't know yet whether tt-rss works well with these queries. Based on xpath validators, they should.
edit: ahh, unfortunately this does not seem to work with your code. So far you only use the first entry from the query, instead of adding all of them to the article text.
Selecting several specific divs / tags: //h1 | //h2 | //h3 //div[@id='artikelKolom']/div[@class='zaktxt clear']/div[@class='zak_normal'] | //div[@id='artikelKolom']/p Note: sequence matters when doing it like this! //h1 | //h2 | //h3 will show first all h1's, followed by all h2's and then all h3's //div[@id='artikelKolom']/*[contains(@class,'zaktxt') or name()='p'] Note: sequence does not seem to matter, sequence is based on sequence in file
Select all div's with certain classes. No need for the div's to have the same parent //div[@class='content illustrated' or @class='post-body'] //div[contains(@class,'illustration top')] | //div[contains(@class,'post-body')] //div[contains(@class,'illustration top') or contains(@class,'post-body')] Note: not sure whether sequence matters
Select all children from div id='artikelKolom', except children with div class='broodtxt' or div class='bannercenter ...' //div[@id='artikelKolom']/[@class!='broodtxt'] //div[@id='artikelKolom']/[not(@class='broodtxt')] //div[@id='artikelKolom']/[not(contains(@class, 'broodtxt'))] //div[@id='artikelKolom']/[not(contains(@class, 'broodtxt')) and not(contains(@class, 'bannercenter'))]
I think it'll get too complicated if you need to "puzzle" the result together like this. Also it'll get worse when the source changes its layout (like N24 did some days ago).
Maybe I'll implement a blacklist which will remove certain XPath elements from the result. I think this is more robust.
A blacklist would be realy nice :D Also I've a big problem with welt.de ... their feed url links to an overview page... there should be an rewrite of the sourceurl like: http://www.welt.de/?config=articleidfromurl&artid=115415142 should be http://www.welt.de/article115415142
Would be phantastic to see this features :D
Hi,
is there a way to use all entrys from the query, instead of adding only the first to the article text?
div[@class='news-single-item']/p ==> only returns the first found p content
div[@id='news-single-item']/*[not(div[@class='comments'])] ==> doesn't work :(
Thank you for your answer.
Kasad
Yes, but you need to make some changes to the init.php file. I did this last weekend and this week it seems to work as expected. See https://github.com/bfly75/ttrss_plugin-af_feedmod.
On Sun, Apr 21, 2013 at 12:29 PM, Kasad [email protected] wrote:
Hi,
is there a way to use all entrys from the query, instead of adding only the first to the article text?
div[@class https://github.com/class='news-single-item']/p ==> only returns the first found p content
div[@id https://github.com/id='news-single-item']/*[not(div[@classhttps://github.com/class='comments'])] ==> doesn't work :(
Thank you for your answer.
Kasad
— Reply to this email directly or view it on GitHubhttps://github.com/mbirth/ttrss_plugin-af_feedmod/issues/2#issuecomment-16719473 .
Ronald Capel Wilhelminaplein 127, 4201 GW Gorinchem, The Netherlands (maphttp://maps.google.nl/maps?f=q&source=s_q&hl=en&geocode=&q=Wilhelminaplein+127,+Gorinchem&aq=0&sll=52.27488,5.515137&sspn=3.97308,9.876709&ie=UTF8&hq=&hnear=Wilhelminaplein+127,+Gorinchem,+Zuid-Holland&ll=51.827477,4.973845&spn=0.007838,0.01929&t=h&z=16 |park http://www.ronaldcapel.nl/prive/parkeren) Mob: +31-(0)6-55836128 Email: [email protected]
Wow, thank you very much - this works awesome :D
I think post-processing should also rip out (at least) id, class and style attributes from the content. Some pages I fetch using feedmod have elements with ids such as "overlay" in them that pick up tt-rss's styling, making things look wonky.
@bfly75: Thanks for that modification! @mbirth: You should consider incorporating bfly75's modification. Maybe by creating a new type (eg. xpath-all-matches).
I just merged changes from @rangerer which add a new "cleanup" option to remove unwanted parts from the main XPath node. He also has provided a lot of examples.
Another thing this one should do: Make all URLs absolute (i.e. fully qualified including "http://www.example.org/) because like in #22, relative images are not shown.
Hi,
after my ttrss crashed I couldn't use the version of bfly75 any longer. Could you please add his way to display more than one div?
Greetings K