[FR]: Documentation on migration between QuiteRSS and RSSGuard
Documentation of migration methodology between QuiteRSS and RSS Guard
Hi,
I'm on an experiment to migrate 6 years of RSS feeds history from QuiteRSS to RSSGuard. My PR is a proposal to add a chapter on documentation to explain how to migrate between those particular softwares.
If you don't mind, I will use this issue to document my migration, and after, probably a PR associated :)
There is 4 elements that are on my actual setup that I need to migrate :
- feeds articles
- feeds (url, name, metadata)
- folders (which correspond to categories on RSSguard)
- filtering on feeds/folder (which are really different, QuiteRSS use a specific filtering widget, RSSguard use javascript function)
Hopefully, QuiteRSS use sqlite database which have similar database schema.
So for feeds articles and folder, I started by creating a python script for transcription between databases, but I discovered it is really more simple to use OPML format which is support by both !
So with icon loading, it seem that feeds and folder are imported in 17 minutes, but correctly. Now I can work on feeds articles import and filtering on feeds.
It seem that nothing talk about OPML import in the documentation, so I think I will do a part of the job here if you don't mind :)
So, for importing feeds, I think I need to do a conversion table for feeds ID like that :
To create this, I've extracted on CSV feeds id and feeds title with SqliteBrowser from :
- for QuiteRSS :
SELECT id, text FROM feeds; - form RSS Guard :
SELECT id, title FROM Feeds;
So I will do a little script to create a conversion table from title between those two ids.
Ok, I've made a quick python script to create my conversion table in CSV here :
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse, csv
if __name__ == "__main__":
print("Welcom to my folder/categories migration tool between QuiteRSS and RSSGuard")
parser = argparse.ArgumentParser()
parser.add_argument("--quitefile", type=str, help="path for QuiteRSS feeds id and title on CSV format")
parser.add_argument("--guardfile", type=str, help="path for RSSGuard feeds id and title on CSV format")
args = parser.parse_args()
print(args)
rowsquite = []
rowsguard = []
rowsfinal = []
rowsfinalfields = ['title', 'idquite', 'idguard']
with open(args.quitefile, 'r') as rquitefile:
rquitereader = csv.reader(rquitefile, delimiter=',', quotechar='"')
firstrow = True
for row in rquitereader:
rowsquite.append({'id' : row[0], 'title' : row[1]})
rquitefile.close()
with open(args.guardfile, 'r') as rguardfile:
rguardreader = csv.reader(rguardfile, delimiter=',', quotechar='"')
firstrow = True
for row in rguardreader:
rowsguard.append({'id' : row[0], 'title' : row[1]})
rguardfile.close()
for i in rowsquite:
print(i)
for e in rowsguard:
if(i['id'] == e['id']):
rowsfinal.append({'title' : i['title'], 'idquite' : i['id'],'idguard' : e['id']})
break
with open("conversionfeedsid.csv", "w") as f:
w = csv.DictWriter(f, rowsfinalfields)
w.writeheader()
w.writerows(rowsfinal)
f.close()
else:
print("this is a CLI tool, please use it on a proper shell")
And I start it with python3 feedconversiontable.py --quitefile quiterssid.csv --guardfile rssguardid.csv
Now, I need to create another script to connect to the RSSguard database and import news on the right feeds with this conversion table.
I started this second script, but I need to compare tables on QuiteRSS and RSSguard, so here is the table format comparison :
At left, QuiteRSS, at right, RSS Guard. So I guess I will not keep all data, but mainly, I need to transfert :
- feed id associated to article (I will use my previous conversion table for that)
- article title
- article texte
- article url
- article author
- article creation date
- article html
- article categories (doesn't seem to be in RSS Guard ???)
Ok, I've made the SQL query for the QuiteRSS database, it get most of the news feeds (but I have something like 400k articles to transfert) : SELECT feedId, title, link_href, author_name, description, published FROM news LIMIT 0,10;
I think I should use LIMIT by 500 article and make bunch transfert.
I'm going to try it with 500 articles, and if it works, I share the script here
I tried a first import and it seem to work, but it's really slow. I'm currently creating an option to interrupt and continue the migration. Once it's done, I release the migration script.
Quiterss and RSS Guard both use SQlite.
So fairly simple python script with some sqlite queries should do the job. I recommend avoiding any intermediary formats like CSV.
It would be best if the script took two parameters - path to quiterss DB file and path to rssguard DB file and did all the necessary job.
Hi,
Your right because my first idea to create a conversion table between those two ids was incredibly wrong.
OPML files are not doing straight copy of feed metadata on import/export.
When I try to import for example my feed for the tor projet, I have on quiterss my custom name "torprojet" and the feed url "http://blog.torproject.org/blog/feed", and for rssguard, I have "Tor Project blog" and "https://blog.torproject.org/feed.xml".
So I need to do also copy of feed table. :/
Ok, so on QuiteRSS, categories and feeds are on the same table, so I need first to separate them. For that I've made a quick code part here :
#here we open the sqlite3 db of RSS Guard
condest = sqlite3.connect(args.guarddb)
#here, we read feeds on QuiteRSS sqlite database
con = sqlite3.connect(args.quitedb)
## firstly, we will import feeds from QuiteRSS and convert them to categories and feeds
#we start by extracting categories :
categoriesrows = con.execute("SELECT id, text, title, parentId FROM feeds WHERE xmlUrl IS NULL")
ordr = 1
for row in categoriesrows:
try:
querytext = """INSERT INTO "Categories"
("id", "parent_id", "title", "description", "account_id", "ordr")
VALUES (?, ?, ?, ?, 1, ?);"""
if(int(row[3]) == 0):
condest.execute(querytext, (row[0], -1, row[1], row[2], ordr))
else:
condest.execute(querytext, (row[0], row[3], row[1], row[2], ordr))
except Exception as e:
print(e)
exit()
ordr += 1
It seem to work fine for the moment, and for feeds, I need probably to do something similar but with WHERE xmlUrl IS NOT NULL.
I've done the same thing for feeds and it seem to work (excluding some column that I don't understand) :
# we continue by extracting feeds
feedsrows = con.execute("SELECT id, text, title, parentId, xmlUrl FROM feeds WHERE xmlUrl IS NOT NULL")
for row in feedsrows:
try:
querytext = """INSERT INTO "main"."Feeds"
("id", "ordr", "title", "description", "category", "source", "update_type", "account_id", "custom_id")
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?);"""
if(int(row[3]) == 0):
condest.execute(querytext, (row[0], ordr, row[1], row[2], -1, row[4], 1, 1, row[0]))
else:
condest.execute(querytext, (row[0], ordr, row[1], row[2], row[3], row[4], 1, 1, row[0]))
except Exception as e:
print(e)
exit()
ordr += 1
Ok, now I try to insert the whole totality of my database, if it's working, I've just need to test if Guard RSS can handle this amount of data, and if not, I will open other bugs issues and features in my quest to an optimized feed reader
I think I found my error, I've done this SQL query :
rowsquery = "SELECT feedId, title, link_href, author_name, description, published, deleted FROM news LIMIT {}, {};".format(steprows, steprows+500)
and it seem to add the 500 first row + x rows for the second steprows incrementations.
I've fixed that with :
rowsquery = "SELECT feedId, title, link_href, author_name, description, published, deleted FROM news LIMIT 500 OFFSET {};".format(steprows)
SQL have no mercy for me... ^^'
It seem to work, and with only 20 minutes !
I try to verify if my data are correct now, if it work, I will need your advice on my script Martin !
Here is my script :
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import argparse, csv, sqlite3
from dateutil.parser import parse as catdateparse
import time
if __name__ == "__main__":
print("Welcom to my news history migration script from QuiteRSS to RSSGuard")
parser = argparse.ArgumentParser()
parser.add_argument("--quitedb", type=str, help="path for QuiteRSS sqlite database")
parser.add_argument("--guarddb", type=str, help="path for RSSGuard sqlite database")
args = parser.parse_args()
#here we open the sqlite3 db of RSS Guard
condest = sqlite3.connect(args.guarddb)
#here, we read feeds on QuiteRSS sqlite database
con = sqlite3.connect(args.quitedb)
## firstly, we will import feeds from QuiteRSS and convert them to categories and feeds
#we start by extracting categories :
categoriesrows = con.execute("SELECT id, text, title, parentId FROM feeds WHERE xmlUrl IS NULL")
ordr = 1
for row in categoriesrows:
try:
querytext = """INSERT INTO "Categories"
("id", "parent_id", "title", "description", "account_id", "ordr")
VALUES (?, ?, ?, ?, 1, ?);"""
if(int(row[3]) == 0):
condest.execute(querytext, (row[0], -1, row[1], row[2], ordr))
else:
condest.execute(querytext, (row[0], row[3], row[1], row[2], ordr))
except Exception as e:
print(e)
exit()
ordr += 1
# we continue by extracting feeds
feedsrows = con.execute("SELECT id, text, title, parentId, xmlUrl FROM feeds WHERE xmlUrl IS NOT NULL")
for row in feedsrows:
try:
querytext = """INSERT INTO "main"."Feeds"
("id", "ordr", "title", "description", "category", "source", "update_type", "account_id", "custom_id")
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?);"""
if(int(row[3]) == 0):
condest.execute(querytext, (row[0], ordr, row[1], row[2], -1, row[4], 1, 1, row[0]))
else:
condest.execute(querytext, (row[0], ordr, row[1], row[2], row[3], row[4], 1, 1, row[0]))
except Exception as e:
print(e)
exit()
ordr += 1
## secondly we gather how many query we need to do :
countrows = con.execute("select count(id) from news;")
countrows = countrows.fetchone()[0]
steprows = 0
rows = None
# we loop until we have the whole news database
while(steprows < countrows):
rowsquery = "SELECT feedId, title, link_href, author_name, description, published, deleted FROM news LIMIT 500 OFFSET {};".format(steprows)
print(rowsquery)
rows = con.execute(rowsquery)
for row in rows:
# here, we inject QuiteRSS news on RSS Guard Messages
# first if the article is deleted, we don't insert it
if(int(row[6]) == 1):
#print("row {} with title {} deleted, so not imported".format(row[0], row[1]))
continue
# else, we try to insert the article
try:
#here, we convert timestamp
timestamp = catdateparse(row[5])
if timestamp.year <= 1971:
timestamp = time.time()
else:
timestamp = timestamp.timestamp()
querytext = """INSERT INTO Messages
("is_read", "is_important", "is_deleted", "is_pdeleted",
"feed",
"title",
"url",
"author",
"date_created",
"contents",
"enclosures",
"score",
"account_id",
"custom_id",
"custom_hash",
"labels") VALUES (
0,
0,
0,
0,
?,
?,
?,
?,
?,
?,
'[]',
0,
1,
'',
'',
'.')"""
if(row[1] == ''):
condest.execute(querytext, (row[0], "no title", row[2], row[3], timestamp, row[4],))
else:
condest.execute(querytext, (row[0], row[1], row[2], row[3], timestamp, row[4],))
except Exception as e:
print(e)
print(querytext)
print((row[0], row[1], row[2], row[3],row[5], row[4],))
exit()
steprows += 500
print("end of insert, start commiting")
condest.commit()
print("end of commit, closing database connexion")
con.close()
condest.close()
else:
print("this is a CLI tool, please use it on a proper shell")
Do you have any advice on it before I start writing documentations on it @martinrotter ? :)
ps : thanks for your time and advices, I was sure that migrating to RSS Guard was a good idea once I readed your documentation !
Nice script. I will test.
I recommend this. Create your own repository with the script. Once you do, post the repository URL here and I will create the docs for rssguard for it.
Script suggestions:
- You assume that
account_idfor imported data is1. What will happen if the existing destination RSS Guard database already has some account with ID1? The account could potentially be of another type like Nextcloud News (or some other supported by RSS Guard) in which case the DB data would likely get messed up. Your script should checkAccountstable first and determine correctaccount_id. - Your script should backup the destination DB file first, probably.
- In some tables, there is
ordrcolumn which is used for sorting order of those items. Note that values of that column have to start from 0 and they must increment properly in a continuous way0, 1, 2, 3, .... If not, the sorting algorithms in RSS Guard would break.
Otherwise I do not see any big problems and at least for rough conversion the script is fine. :)
Hi,
I've done a backup operation (the 2 suggestion), and also notice that you use icon encoded in base64 on the feeds table like QuiteRSS.
So I tried to import Icon from QuiteRSS and I noticed something weird :
From youtube, the Icon is this one on : https://external-content.duckduckgo.com/ip3/youtube.com.ico -> RSS Guard seem to convert this ICO file in PNG
But here is Icon base64 on QuiteRSS :
I think QuiteRSS Icon file are not correctly rendered because RSS Guard handle PNG and not ICO.
Maybe a enhancement ?
Ok, I've added a quick check for account ID and completed this part.
For the third suggestion, I noticed there is a feature to sort feeds and categories by alphabetical order, is this reordering ? I confess that I'm not really confortable with the ordr column, I tried to understand, but specially for feeds, I noticed there is an ordr associated to the category, which mean I need to create on my script a dictionnary of ordr counter where keys are categories.
I wonder if the feature of alphabetical reordering would be more easy to use ?
Following your advice, I've made my repository : https://github.com/darcosion/Migration_QuiteRSS_RSSGuard
Do you still want to create the documentation or did you prefer to let me improve my english ? :)
Ok, following the migration, I tried today a first fetching of my feeds (spoiler : it didn't went well).
On every feed, I have duplicate article like that :
So I need to understand how RSS Guard handle duplicate and if those duplicate are related to a bad data migration or if it's something that should be handled by a duplicate detection filter.
There is very robust algorithm which determines if messages are "same" or not.
https://github.com/martinrotter/rssguard/blob/master/src/librssguard/database/databasequeries.cpp#L1610
Majority of feeds now provide GUID for each of the messages and RSS Guard primarily relies on this GUID. This is stored in custom_id column in RSS Guard DB.
So, if you convert articles from QuiteRSS and they HAVE GUID, you have to make sure the GUID is copied into RSS Guard DB as well. It seems that quiterss stores GUID in guid column in news table.
Following your advice, I've made my repository : https://github.com/darcosion/Migration_QuiteRSS_RSSGuard
Do you still want to create the documentation or did you prefer to let me improve my english ? :)
Once your script works completely, I will update the docs.
Ok, I've added a quick check for account ID and completed this part.
For the third suggestion, I noticed there is a feature to sort feeds and categories by alphabetical order, is this reordering ? I confess that I'm not really confortable with the ordr column, I tried to understand, but specially for feeds, I noticed there is an ordr associated to the category, which mean I need to create on my script a dictionnary of ordr counter where keys are categories. I wonder if the feature of alphabetical reordering would be more easy to use ?
Yes, column ordr is kept filled with values even when automatic alphabetical sorting is enabled. If it is disabled then values from ordr are used directly to sort feed list entries.
Categories.
id ordr parent
1 0 -1 Svésedlice
2 1 -1 IT
3 2 -1 Ostatní
Feeds.
ordr category_id
0 Úřední deska | svesedlice.cz 1
1 Aktuality | svesedlice.cz 1
2 Hlášení rozhlasu | svesedlice.cz 1
0 Arch Linux: Recent news updates 2
1 abclinuxu - aktuální články 2
2 abclinuxu - čerstvé zprávičky 2
3 Linux, GNU/Linux, free software... 2
4 RSS - Really Simple Syndication 2
5 Root.cz - zprávičky 2
0 D-FENS Weblog 3
1 KOSMAS.cz | historie a fakta 3
To explain ordr values, they always start from 0 and alway increment by 1. And they always are specific in regards to the PARENT of the feed or category. Special value -1 means the item has no parent (either category is top-level or feed is top-level).
You can see we have 3 top-level categories with their order, als we have a bunch of feeds and each of them belong to some categories and each group of these feeds have their own order starting with 0.
You can make your order up when migrating but have to adhere to the above rules, otherwise the DB structure is not what RSS Guard expects and things will break. It is not that hard to write the code for this (I myself have not time to do that right now), but I agree that some level of Python/DB knowledge is needed for this.
There is very robust algorithm which determines if messages are "same" or not.
https://github.com/martinrotter/rssguard/blob/master/src/librssguard/database/databasequeries.cpp#L1610
Majority of feeds now provide GUID for each of the messages and RSS Guard primarily relies on this GUID. This is stored in
custom_idcolumn in RSS Guard DB.So, if you convert articles from QuiteRSS and they HAVE GUID, you have to make sure the GUID is copied into RSS Guard DB as well. It seems that quiterss stores GUID in
guidcolumn innewstable.
Hi,
I tried to import GUID, it doesn't seem to work entirely. I think I will open another issue for that ?
Hi Martin, I've fixed ordr and now, it seem mysteriously to solve the GUID bug \o/
Sometimes RSS feeds seem to doesn't work, I load metadata and I work :o
I've updated the script and added feature regarding guid and ordr : https://github.com/darcosion/Migration_QuiteRSS_RSSGuard/commit/ef3613c0ca1b6f066ea0d6084b58465e328c2bae
I think it is stable, if you think there is other column I should import, feel free to specify it (because once I've finalized my migration, there is no turn back for me...) ;)
(I think I have a timestamp conversion bug, I will fix it and update the repo btw, nothing so fancy)
Formidable! Merci pour votre script. Moi aussi, j'utilise quiterss depuis plusieurs années et j'ai enregistré des milliers de messages. Maintenant, je peux tous mes messages au même endroit. Peut-être…
1
Anglais désormais, parce que, ton anglais est meilleur que mon français...
The script has a problem if run with an already-in-use rss guard db. I'm getting a constraint violation when trying to import into my already-in-use database.db. You're importing the categories and feeds with the existing ID, which makes sense because you need to keep the relationship with the message intact. But this will inevitabley lead to collisions in an already populated db.
$ python insertnews.py --quitedb /mnt/x/QuiteRSS/feeds.db --guarddb /mnt/x/RSS\ Guard\ 4/database/database.db
Welcom to my news history migration script from QuiteRSS to RSSGuard
RSS Guard backup
copy categories
{}
copy feeds
UNIQUE constraint failed: Feeds.id
Something like doing a count(id) and use that value as an offset to start incrementing feeds.id in a non-conflicting manner would work but then you'd need to map the feed.id-old to feed.id-new and write the feed.id-new with its messages.
I don't do python but maybe something like would work better:
while feedrows < feedrows(max), do
import feedrow
if category
insert with renumber
continue
if feed
insert with renumber
insert messages, fixing id
done
2
Spinning up a new profile got past that error but probably already recognize the problem I had importing into this new db. No categories...
$ python insertnews.py --quitedb /mnt/x/QuiteRSS/feeds.db --guarddb /mnt/x/RSS\ Guard\ 4/database/database.db Welcome to my news history migration script from QuiteRSS to RSSGuard
RSS Guard backup
copy categories
{}
copy feeds
{234: 24, 0: 0, 75: 9, 2: 20, 1500: 37, 87: 30, 419: 28, 160: 2, 189: 14, 217: 23, 1187: 11, 256: 21, 264: 33, 1598: 26, 633: 355, 1512: 327, 1340: 14, 1692: 12, 1622: 25, 1763: 7, 1804: 5, 1808: 12, 1838: 8, 1848: 4, 1854: 2, 1858: 3, 1860: 6}
copy news
SELECT feedId, title, link_href, author_name, description, published, deleted, guid FROM news LIMIT 500 OFFSET 0;
<snipped>
SELECT feedId, title, link_href, author_name, description, published, deleted, guid FROM news LIMIT 500 OFFSET 71500;
end of insert, start commiting
end of commit, closing database connexion
It did import the messages into the db but no feeds and they were all dumped In the root folder. not that useful... ;) In examining my quiterss db with DB Browser, I see that, for me, the xmlUrl column is empty, not null. htmlUrl is null so I tried again with the follwing diff:
31 - categoriesrows = con.execute("SELECT id, text, title, parentId FROM feeds WHERE xmlUrl IS NULL")
32 + categoriesrows = con.execute("SELECT id, text, title, parentId FROM feeds WHERE htmlUrl IS NULL")
and this time it worked. xmlUrl has a url for all the actual feeds so no edit needed for the feedrows = conditional.
$ python insertnews.py --quitedb /mnt/x/QuiteRSS/feeds.db --guarddb /mnt/x/RSS\ Guard\ 4/database/database.db
Welcome to my news history migration script from QuiteRSS to RSSGuard
RSS Guard backup
copy categories
{0: 0, 2: 18, 1512: 0, 1804: 5}
copy feeds
{0: 1, 2: 39, 1512: 328, 1804: 11, 234: 24, 75: 9, 1500: 37, 87: 30, 419: 28, 160: 2, 189: 14, 217: 23, 1187: 11, 256: 21, 264: 33, 1598: 26, 633: 355, 1340: 14, 1692: 12, 1622: 25, 1763: 7, 1808: 12, 1838: 8, 1848: 4, 1854: 2, 1858: 3, 1860: 6}
copy news
SELECT feedId, title, link_href, author_name, description, published, deleted, guid FROM news LIMIT 500 OFFSET 0;
<snipped>
SELECT feedId, title, link_href, author_name, description, published, deleted, guid FROM news LIMIT 500 OFFSET 71500;
end of insert, start commiting
end of commit, closing database connexion
I am a sample set of 1 so I don't know if xmlUrl or htmlUrl is the proper column. Or maybe Description since those are all null for me and there's no way to enter in a description in the quiterss app? Or status since that never gets updated? I started using quiterss in 2016-17?, so maybe the schema was changed between the time i startedf using it and you did?
3
In looking over the category output, it looks like the counts are 0-indexed and so the count is off (there are 27 rows):
{0: 0, 2: 18, 1512: 0, 1804: 5}
0+18+0+5 = 23 1+19+1+6 = 27
but the feeds output adds up correctly to 1085:
{0: 1, 2: 39, 1512: 328, 1804: 11, 234: 24, 75: 9, 1500: 37, 87: 30, 419: 28, 160: 2, 189: 14, 217: 23, 1187: 11, 256: 21, 264: 33, 1598: 26, 633: 355, 1340: 14, 1692: 12, 1622: 25, 1763: 7, 1808: 12, 1838: 8, 1848: 4, 1854: 2, 1858: 3, 1860: 6}
oneliner to do the math:
<copy text in between the braces>
$ paste | tr ',' '\n' | awk -F":" '{sum+=$2;} END{print sum;}'
4
as you mention, the date conversion is off. in looking at the db directly, it looks like you're converting the value into epoch seconds correctly but you're writing the value in seconds resolution (10 digits) but rss guard needs the value in nanoseconds (13 digits). add 3 zeros to the end of the datestring and it should be fine.
Bien à vous...
Thanks for your advice ! I will definitely do some quickfix regarding that epoch and xmlURL. For the third point, I also noticed some of my xmlURL had empty value instead of NULL but I fixed them directly on SqliteBrowser because it was four feeds out of a total of 4,000.
Again, thanks for the timestamp analysis, I didn't had the time to dig on it and now I just need to patch and test ;)