fundus icon indicating copy to clipboard operation
fundus copied to clipboard

Add new publisher “The Mirror”

Open TingC99 opened this issue 1 year ago • 2 comments

Hey, I'm trying to add new publishers "The Mirror", but I'm having trouble running this part: python -m scripts.generate_parser_test_files -p TheMirror He keeps showing 0% as if he's stuck and not making any progress

TheMirror: 0%| | 0/1 [00:00<?, ?it/s]

Maybe anyone has any good ideas?

TingC99 avatar Apr 26 '24 15:04 TingC99

Hi, thanks for adding the mirror :). From what I can see you are still missing a function to extract the topics. The Mirror seems to support the meta tag keywords and news_keywords. Also, the publishing date is not extracted. For this you can save yourself some trouble and use the meta tag parsely-pub-date. Furthermore the actual content also does not seem to be extracted. The script won't run properly because it's set up to get an article that has all attributes you implemented. If for example the publishing date is missing, it will skip that article and try the next one. That has been happening over and over for you. I would recommend running this script


crawler = Crawler(PublisherCollection.uk.TheMirror)

for article in crawler.crawl(max_articles=30, only_complete=False):
    print(article.title)
    print(article.html.responded_url)
    print(article.publishing_date)
    print(article.authors)
    print(article.topics)
    print(article.plaintext)
    print("------ New Article ------:\n")
    

to verify your progess.

addie9800 avatar Apr 27 '24 18:04 addie9800

Thank you for your advice. I have added topics and time. But I also encountered this error: `ValueError: Invalid isoformat string: '2024-04-28T13:00:00Z'

But theoretically this should be in ISO 8601 format.

TingC99 avatar Apr 28 '24 14:04 TingC99

So, as it turns out I misunderstood you. The parsing error occurs when running pytest and generating the test files, not when running fundus itself. The reason this was happeing is that we were using datetime.datetime.fromisoformat() in the backend, which did not like the JavaScript default of adding Z to indicate a timezone time difference. Using dateutils, this problem is solved. I took the liberty of doing some minor changes as well :) Thanks a lot for adding this.

addie9800 avatar May 02 '24 14:05 addie9800

@addie9800 It seems like someone added the date by hand, otherwise I can't think about how the Z ended up there. I overwrote the test case.

MaxDall avatar May 02 '24 18:05 MaxDall

Yeah, you're right. That was sort of weird, this closes #505 though. It seemed a lot like that was the cause.

addie9800 avatar May 02 '24 20:05 addie9800