yarb
yarb copied to clipboard
rss格式问题
yarb.py中,在parseThread中解析rss xml的内容时,有些updated_parsed字段会放在feed块中,而不在entries中,就会报错:
'entries': [
]
...
'feed': {
'title': 'Talkback Tech',
'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': 'Talkback Tech'},
'links': [
{'rel': 'alternate', 'type': 'text/html', 'href': 'https://talkback.sh/tech/feed/'},
{'href': 'https://talkback.sh/tech/feed/', 'rel': 'self', 'type': 'application/atom+xml'}
],
'link': 'https://talkback.sh/tech/feed/',
'subtitle': 'Latest technical resources on Talkback',
'subtitle_detail': {'type': 'text/html', 'language': None, 'base': '', 'value': 'Latest technical resources on Talkback'},
'language': 'en-us',
'updated': 'Mon, 05 Aug 2024 03:08:08 +0000',
'updated_parsed': time.struct_time(tm_year=2024, tm_mon=8, tm_mday=5, tm_hour=3, tm_min=8, tm_sec=8, tm_wday=0, tm_yday=218, tm_isdst=0)
},
这里加上对d变量的检查,将d变量从feed块中取。 同时有些rss订阅只会有当天发布的链接,这里将当天和昨天发布的链接都放在一起防止抓不到当天的订阅内容:
...
for entry in r.entries:
d = entry.get('published_parsed') or entry.get('updated_parsed')
+ if(not d):
+ d = (r.feed.updated_parsed)
yesterday = datetime.date.today()# + datetime.timedelta(-1)
pubday = datetime.date(d[0], d[1], d[2])
- if (pubday == yesterday) and filter(entry.title):
+ if (pubday == yesterday or datetime.date.today()+datetime.timedelta(-1) == pubday) and filter(entry.title):
item = {entry.title: entry.link}
# print(item)
result |= item
...