fundus icon indicating copy to clipboard operation
fundus copied to clipboard

[WIP] Add publisher pravda (Ukrainska Pravda)

Open bucheben opened this issue 2 months ago • 6 comments

WIP, because I require external input to complete this PR.

Ukrainska Pravda is a news site which publishes most articles in Ukrainian, English, and Russian. The sitemap directly gives me the page of the NewsMap. In there, the same article in the different languages are grouped together and differentiated with the hreflang tag, does fundus support it? Should I instead simply select a single language for now?

The primary issue for me right now is how to extract the article body, as this is the part on which the tutorial focuses the least.

This English article for instance has the entire article body (only!) in the easily accessible precomputed.ld data. Meanwhile, this Ukrainian article spreads out the article over many

tags. This Ukrainian finance article suddenly places the author somewhere completely different.

Was I unfortunate with my choice of news site, or am I missing too much domain knowledge in DOM navigation?

bucheben avatar Oct 23 '25 12:10 bucheben

@bucheben Thanks so much for your work so far!

Was I just unlucky with my choice of news site, or am I missing some key knowledge about DOM navigation?

Regarding the sitemap: you definitely picked a challenging one 😅, but I’m glad you did! It actually seems like a great opportunity to extend Fundus with some useful new functionality, so thank you for choosing this outlet.

For now, I’d suggest continuing to use the existing Fundus implementation, as you’ve already done, and simply include the sitemap as well.

As for the article body: after looking through a few examples, it seems that despite the different languages, the articles all share the same layout. For example, in all three cases, the paragraphs can be selected using this CSS selector:

div.post_news_text > p

This Ukrainian finance article suddenly places the author somewhere completely different.

If possible, try to extract the author information from the article’s metadata — for example, from the ld or meta tags located in the precomputed section.

Let me know if that resolves your issue or if there’s anything else I can help with!

MaxDall avatar Oct 24 '25 09:10 MaxDall

Thanks for the quick response! Your CSSSelector works nicely, I'm not sure what I did when I determined that the article body is only in the ld section in that one article.

The retrieval of the author is odd, to say the least. For instance, the two most recent articles are the same one in Ukrainian and Russian.

Russian article `self.precomputed.ld.__dict__`
{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Египет стал главным покупателем украинского зерна', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https
://epravda.com.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/'}, 'headline': 'Египет стал главным покупателем украинского зерна', 'datePublished': '2025-10-24 15:10
:00', 'dateModified': '2025-10-24 15:10:00', 'image': {'@type': 'ImageObject', 'url': '?q=90&w=1920', 'height': 1200, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Экономичес
кая правда', 'alternateName': 'Экономическая правда'}, 'description': 'Египет стал основным импортером украинского зерна в октябре.', 'publisher': {'type': 'Organization', 'name': 'Экономич
еская правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}
}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Экономическая правда
'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Бизнес', 'name': 'https://epravda.com.ua/rus/biznes/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com
.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/', 'name': 'Египет стал главным покупателем украинского зерна'}}]}, 'ProfilePage': {'@context': 'https://schema.org',
 '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 5383, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/f/0/46204/f0552447071e174c53fb0ccc8e3e9693.
jpeg', 'description': 'редактор стрічки новин', 'name': 'Андрій Муравський'}}, '_LinkedDataMapping__xml': None}
Ukrainian article `self.precomputed.ld.__dict__`
{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/'}, 'headline': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', 'datePublished': '2025-10-24 14:50:00', 'dateModified': '2025-10-24 14:50:00', 'image': {'@type': 'ImageObject', 'url': 'https://img.epravda.com.ua/epravda/images/doc/2/f/55588/2fb1bb061339220fc50832e157b70fb2.jpeg?q=90&w=1920', 'height': 672.1649484536083, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Економічна правда', 'alternateName': 'Економічна правда'}, 'description': 'До кінця 2025 року державні компанії планують встановити ще 400 МВт розподіленої газової генерації.', 'publisher': {'type': 'Organization', 'name': 'Економічна правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Економічна правда'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Енергетика', 'name': 'https://epravda.com.ua/energetika/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року'}}]}, 'ProfilePage': {'@context': 'https://schema.org', '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 2172, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/0/4/2250/041f5fe-victor-volokhita-160.jpg', 'description': 'Редактор новин "Економічної правди"\r\n<br>\r\n<br>В ЕП з травня 2024 року. До цього останні 10 років працював у виданні "Наші гроші".', 'name': 'Віктор Волокіта'}}, '_LinkedDataMapping__xml': None}

Both articles appear to list Андрій Муравський as the article author on the website. In the Russian version, I can simply retrieve the author with self.precomputed.ld.xpath_search('ProfilePage/mainEntity/name'). On the Ukrainian, version, the name doesn't even appear in the ld and instead it lists Віктор Волокіта in the same place. After writing this whole thing, this might just be an error of the news site(?) :/

However, this seems to be mostly-ish working for now. More testing pointed me to news articles which break the parser entirely. E.g., https://www.pravda.com.ua/news/2025/10/24/8004300/. This link is listed in the newsmap and the links appears very normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their urls appear normal in the newsmap?

On another note, none of the news articles I took a look at had clear subheadings. Some do have <bold> lines. However, like in this article, there are also sometimes bold not-really-subheadings. So as of now I've not set a subheadings selector. The articles do have a summary, but I'm stumped on how to extract them, as they are not the content of a text, but in an attribute.

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

ERROR tests/test_filter.py - DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Con...

This prevents the running of any tests, as these count as errors during the collection of the tests.

mypy src also runs into errors in unrelated files:

src/fundus/parser/utility.py:505: error: List item 9 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]
src/fundus/parser/utility.py:507: error: List item 11 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]

(I haven't added the sitemap yet, will do)

bucheben avatar Oct 24 '25 13:10 bucheben

For instance, the two most recent articles are the same one in Ukrainian and Russian.

I’d rely on the most trustworthy source, in this case, the author listed directly on the page. You can extract it using a CSS selector such as span.post_news_author.

Additionally, I’ve noticed that many problematic articles come from a different (though similar) domain: https://epravda.com.ua/ instead of the original https://www.pravda.com.ua/. This likely stems from the alternative articles. If that’s the case, focus only on those from https://www.pravda.com.ua/.

This link is listed in the newsmap and looks normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their URLs appear normal in the newsmap?

Yes, you can use the url_filter parameter in Publisher, for example:

DieWelt = Publisher(
    name="Die Welt",
    ...
    url_filter=regex_filter("/Anlegertipps-|/videos?[0-9]{2}|/mediathek/"),
)

This filters out URLs containing specific substrings (e.g., Anlegertipps). Since Fundus filters work inversely to Python’s built-in filtering logic, you can use the Fundus inverse function to allow URLs based on a substring rather than exclude them:

from fundus.scraping.filter import inverse

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

Thanks for pointing that out! It seems Fundus currently has compatibility issues with Python versions above 3.12. I’ll open an issue to investigate further, but for now, I recommend using an older Python version.

mypy src also runs into errors in unrelated files:

That was a known issue on our end (#806) and should be resolved once you merge the latest master branch into your branch.

MaxDall avatar Oct 29 '25 13:10 MaxDall

With these changes I'm now happy with the state of the publisher

bucheben avatar Oct 29 '25 16:10 bucheben

Hm I messed up the history somehow

bucheben avatar Oct 30 '25 10:10 bucheben

fixed it :)

bucheben avatar Oct 30 '25 10:10 bucheben