ttrss_plugin-af_feedmod icon indicating copy to clipboard operation
ttrss_plugin-af_feedmod copied to clipboard

Add post-processing or more specific content selection

Open mbirth opened this issue 11 years ago • 11 comments

N24.de has additional content inside their main article DIV. There should be some filter or a more specific way of selecting the desired content to use.

mbirth avatar Apr 09 '13 20:04 mbirth

I expect this can be done using a more extensively defined xpath query. Below some examples (not N24.de), which might be useful. Today is a slow news day, so I don't know yet whether tt-rss works well with these queries. Based on xpath validators, they should.

edit: ahh, unfortunately this does not seem to work with your code. So far you only use the first entry from the query, instead of adding all of them to the article text.

Selecting several specific divs / tags: //h1 | //h2 | //h3 //div[@id='artikelKolom']/div[@class='zaktxt clear']/div[@class='zak_normal'] | //div[@id='artikelKolom']/p Note: sequence matters when doing it like this! //h1 | //h2 | //h3 will show first all h1's, followed by all h2's and then all h3's //div[@id='artikelKolom']/*[contains(@class,'zaktxt') or name()='p'] Note: sequence does not seem to matter, sequence is based on sequence in file

Select all div's with certain classes. No need for the div's to have the same parent //div[@class='content illustrated' or @class='post-body'] //div[contains(@class,'illustration top')] | //div[contains(@class,'post-body')] //div[contains(@class,'illustration top') or contains(@class,'post-body')] Note: not sure whether sequence matters

Select all children from div id='artikelKolom', except children with div class='broodtxt' or div class='bannercenter ...' //div[@id='artikelKolom']/[@class!='broodtxt'] //div[@id='artikelKolom']/[not(@class='broodtxt')] //div[@id='artikelKolom']/[not(contains(@class, 'broodtxt'))] //div[@id='artikelKolom']/[not(contains(@class, 'broodtxt')) and not(contains(@class, 'bannercenter'))]

bfly75 avatar Apr 14 '13 12:04 bfly75

I think it'll get too complicated if you need to "puzzle" the result together like this. Also it'll get worse when the source changes its layout (like N24 did some days ago).

Maybe I'll implement a blacklist which will remove certain XPath elements from the result. I think this is more robust.

mbirth avatar Apr 14 '13 13:04 mbirth

A blacklist would be realy nice :D Also I've a big problem with welt.de ... their feed url links to an overview page... there should be an rewrite of the sourceurl like: http://www.welt.de/?config=articleidfromurl&artid=115415142 should be http://www.welt.de/article115415142

Would be phantastic to see this features :D

Kasad avatar Apr 19 '13 09:04 Kasad

Hi,

is there a way to use all entrys from the query, instead of adding only the first to the article text?

div[@class='news-single-item']/p ==> only returns the first found p content

div[@id='news-single-item']/*[not(div[@class='comments'])] ==> doesn't work :(

Thank you for your answer.

Kasad

Kasad avatar Apr 21 '13 10:04 Kasad

Yes, but you need to make some changes to the init.php file. I did this last weekend and this week it seems to work as expected. See https://github.com/bfly75/ttrss_plugin-af_feedmod.

On Sun, Apr 21, 2013 at 12:29 PM, Kasad [email protected] wrote:

Hi,

is there a way to use all entrys from the query, instead of adding only the first to the article text?

div[@class https://github.com/class='news-single-item']/p ==> only returns the first found p content

div[@id https://github.com/id='news-single-item']/*[not(div[@classhttps://github.com/class='comments'])] ==> doesn't work :(

Thank you for your answer.

Kasad

— Reply to this email directly or view it on GitHubhttps://github.com/mbirth/ttrss_plugin-af_feedmod/issues/2#issuecomment-16719473 .

Ronald Capel Wilhelminaplein 127, 4201 GW Gorinchem, The Netherlands (maphttp://maps.google.nl/maps?f=q&source=s_q&hl=en&geocode=&q=Wilhelminaplein+127,+Gorinchem&aq=0&sll=52.27488,5.515137&sspn=3.97308,9.876709&ie=UTF8&hq=&hnear=Wilhelminaplein+127,+Gorinchem,+Zuid-Holland&ll=51.827477,4.973845&spn=0.007838,0.01929&t=h&z=16 |park http://www.ronaldcapel.nl/prive/parkeren) Mob: +31-(0)6-55836128 Email: [email protected]

bfly75 avatar Apr 21 '13 10:04 bfly75

Wow, thank you very much - this works awesome :D

Kasad avatar Apr 21 '13 11:04 Kasad

I think post-processing should also rip out (at least) id, class and style attributes from the content. Some pages I fetch using feedmod have elements with ids such as "overlay" in them that pick up tt-rss's styling, making things look wonky.

uusijani avatar Apr 26 '13 12:04 uusijani

@bfly75: Thanks for that modification! @mbirth: You should consider incorporating bfly75's modification. Maybe by creating a new type (eg. xpath-all-matches).

tbar avatar May 07 '13 07:05 tbar

I just merged changes from @rangerer which add a new "cleanup" option to remove unwanted parts from the main XPath node. He also has provided a lot of examples.

mbirth avatar Jun 20 '13 10:06 mbirth

Another thing this one should do: Make all URLs absolute (i.e. fully qualified including "http://www.example.org/) because like in #22, relative images are not shown.

mbirth avatar Jul 30 '13 14:07 mbirth

Hi,

after my ttrss crashed I couldn't use the version of bfly75 any longer. Could you please add his way to display more than one div?

Greetings K

Kasad avatar Aug 13 '13 21:08 Kasad