feedBase
feedBase copied to clipboard
Issues with Unicode Titles
My subscription list is here http://feedbase.io/?username=thej
It has couple of feeds with title in Kannada. Feedbase prints ???
Example
<outline title="ವರ್ತಮಾನ - Vartamaana" text="ವರ್ತಮಾನ - Vartamaana" description="ವರ್ತಮಾನ - Vartamaana" type="rss" xmlUrl="http://www.vartamaana.com/feed/" htmlUrl="http://www.vartamaana.com"/>
when imported into feedbase has title ???
where as in a standard html it would print ok.
HTML
<span>ವರ್ತಮಾನ - Vartamaana</span>
RENDERING
ವರ್ತಮಾನ - Vartamaana
This feedlist opml is an export from liferea.
Thanks for the report.
Here's a little background.
It doesn't matter what the title is in the input OPML file, we don't even store it. What matters is what we get back when we read the feed. For that we use the node.js package feedParser. I believe it is unicode-aware.
I verified that the question marks are in the database (though seeing it in the feed list is basically verification of that, it's where the data comes from).
I'm going to have to do some more investigating. I have a standalone app that just calls feedParser for debugging situations like this.
I suspect it is an encoding issue that came up in River5 recently, which uses the same feed reading package.
OK, I think I have what may be a fix in feedBase.
Here's the test.
http://feedbase.io/readfeed?feedurl=http://kannadascience.blogspot.com/feeds/posts/default
Is that the correct title?
Dave
Yes. It's correct.
Excellent. Over the next few hours feedBase will re-read the feeds, and with any luck the names on your page should become correct. If you can, let me know either way, if it works or doesn't.
Thanks for your help debugging this. I was getting a lot of feedback on this problem. ;-)
Okay -- I've dug in some more and found that I only solved part of the problem, when I added the conversion of foreign charsets, which was quite tricky. Now some more trickiness is to come.
I've subscribed to the feed myself.
http://rashmidiaries.blogspot.com/feeds/posts/default
Then I hooked up the /ping call in feedBase to the exact mechanism that maintains the database, that reads each feed in turn, and updates the title, html link, description.
http://feedbase.io/ping?feedurl=http://rashmidiaries.blogspot.com/feeds/posts/default
As we saw earlier, it's correctly reading the feed over the net. However when I add it to the database, this is what I see.

There are the question marks.
I guess I assumed that all text in MySQL would be UTF-8. But apparently not so.
I'm going to link to this from Scripting News, asking for advice.
As far as I know (From my wordpress experience) while creating database you can set the characterset and collate
CREATE DATABASE feedbase_db CHARACTER SET utf8 COLLATE utf8_general_ci;
and then while creating table you can use
CREATE TABLE feeds (
) ENGINE=MyISAM DEFAULT CHARSET=utf8
This should solve both storing and sorting of Varchars.
In our case we can alter the table
ALTER TABLE feeds MODIFY title VARCHAR(250) CHARACTER SET utf8 COLLATE utf8_general_ci;
"If you use ALTER TABLE to convert a column from one character set to another, MySQL attempts to map the data values, but if the character sets are incompatible, there may be data loss."
In another thread I'm asking for advice on how to backup the database. I don't want to do anything radical that could lose data w/o a good backup.
On the other hand I think mainly we would lose data that we don't actually have yet.
So my next thing will be to get a good backup.
What he, and they on the other thread, said :)
- Do a full backup of the full database using either:
- mysqldump on the command line (from the server shell via ssh if you have access to, or from your desktop specifying the IP address of the server, if the mysql port is open to the world, which might not be the case)
- Sequel Pro on the Mac. It has a good conf interface that makes it mostly easy to configure the remote access to the database, especially if the MySQL port is not open on the server.. you can then connect via SSH. (you’ll set the SSH credential, and the MySQL login/pass, with the server localhost IP.
=> Sequel Pro (or any graphical client) will really show you the structure of the database (the encoding used etc…), will make it easy to import/export data, browse data etc… Very useful when you’re new to MySQL.
- You can then convert the tables to use utf8. (or even "utf8mb4" if you want to store Unicode AND emojis at some point :-) ) If the current Latin1 tables have something else than latin1 stored inside (like utf8..) the convertion might not work. MySQL assumes there is latin1 data stored in a latin1 column.. If so, if you can just rescan the feeds and re-store the data at that point, it would be the easiest path.
JY
Le 3 avr. 2018 à 19:53, Dave Winer [email protected] a écrit :
In another thread I'm asking for advice on how to backup the database. I don't want to do anything radical that could lose data w/o a good backup.
On the other hand I think mainly we would lose data that we don't actually have yet.
So my next thing will be to get a good backup.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
FWIW, I like Sequel Pro for a lot of things, but not their export feature. It always makes a new INSERT statement for every row, which can take forever to import if you have a lot of data. mysqldump does it a lot more intelligently, with INSERT statements that insert thousands of rows.
I'm working on a Node utility that backs up the database to two JSON files, one for feeds and the other for subscriptions.
The feeds file is done, it's 10.7MB. I figure that's not a problem for a Node app even if it's 10 times that size. Much more than that and I guess it might be a problem. That's the way I'm going for now.
Backing it up to a Dropbox folder and a chronologic folder structure, so I will have redundant snapshots.
On Wed, Apr 4, 2018 at 11:33 AM Tim Habersack [email protected] wrote:
FWIW, I like Sequel Pro for a lot of things, but not their export feature. It always makes a new INSERT statement for every row, which can take forever to import if you have a lot of data. mysqldump does it a lot more intelligently, with INSERT statements that insert thousands of rows.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-378643674, or mute the thread https://github.com/notifications/unsubscribe-auth/ABm9O5dwQIwshtoRhzgsjIuHPYxMB5Wpks5tlOfBgaJpZM4S_7gn .
I have backup and restore working, so I have the same installation on my Mac desktop as I have on the server. So I can do the experiments here.
Here's the source for the backup and restore apps.
https://github.com/scripting/feedBase/blob/master/backup/
Dave
BTW, one of the fallouts of what I'm doing right now, is that it will be possible, without any changes to the source, to run your own feedBase. All you'll have to do is change values in config.json.
I just did the ALTER command on the feeds table and tried reading in
http://rashmidiaries.blogspot.com/feeds/posts/default
But got an error when I tried to update the table.
The Mac version of MySQL works somewhat differently than the Unix version. Here I get an error message, on Unix it just fails without a message.
This is the SQL command it runs:
replace into feeds (feedUrl, title, htmlUrl, description, whenUpdated, countSubs, ctSecs, code, ctErrors, ctConsecutiveErrors, ctChecks, whenLastError) values ('http://rashmidiaries.blogspot.com/feeds/posts/default', 'ರಶ್ಮಿ ಅಭಯ ಸಿಂಹ', 'http://rashmidiaries.blogspot.com/', '', '2018-04-05 15:51:44', 2, 0.407, 200, 0, 0, 13, NULL);
And the message:
ER_TRUNCATED_WRONG_VALUE_FOR_FIELD: Incorrect string value: '\xE0\xB2\xB0\xE0\xB2\xB6...' for column 'title' at row 1
The Alter command is --
ALTER TABLE feeds MODIFY title VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci;
I may just try recreating the database, since I can load this database entirely from the JSON.
Where are you running/typing these commands?
Le jeu. 5 avr. 2018 à 21:57, Dave Winer [email protected] a écrit :
I just did the ALTER command on the feeds table and tried reading in
http://rashmidiaries.blogspot.com/feeds/posts/default
But got an error when I tried to update the table.
The Mac version of MySQL works somewhat differently than the Unix version. Here I get an error message, on Unix it just fails without a message.
This is the SQL command it runs:
replace into feeds (feedUrl, title, htmlUrl, description, whenUpdated, countSubs, ctSecs, code, ctErrors, ctConsecutiveErrors, ctChecks, whenLastError) values (' http://rashmidiaries.blogspot.com/feeds/posts/default', 'ರಶ್ಮಿ ಅಭಯ ಸಿಂಹ', 'http://rashmidiaries.blogspot.com/', '', '2018-04-05 15:51:44', 2, 0.407, 200, 0, 0, 13, NULL);
And the message:
ER_TRUNCATED_WRONG_VALUE_FOR_FIELD: Incorrect string value: '\xE0\xB2\xB0\xE0\xB2\xB6...' for column 'title' at row 1
The Alter command is --
ALTER TABLE feeds MODIFY title VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci;
I may just try recreating the database, since I can load this database entirely from the JSON.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379057935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBAWnihz3VpM5TDz8Tjl5bR0-_-Y1ks5tlnc4gaJpZM4S_7gn .
-
It's all happening on my Mac desktop running the official MySQL.
-
The ALTER command is running at the mysql command line.
-
The "replace into feeds..." command is being run by feedBase running on the Mac in response to a /ping command.
For 3, can you check that the charset option for the connection is correct?
https://github.com/mysqljs/mysql/blob/master/Readme.md#connection-options
JY
Le jeu. 5 avr. 2018 à 22:06, Dave Winer [email protected] a écrit :
It's all happening on my Mac desktop running the official MySQL. 2.
The ALTER command is running at the mysql command line. 3.
The "replace into feeds..." command is being run by feedBase running on the Mac in response to a /ping command.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379060306, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBBJo-18Ye3tssaNXAJHdzpYDUtvZks5tlnlFgaJpZM4S_7gn .
I did this --
"database": { "host": "localhost", "port": 3306, "user": "root", "password": "xxx", "charset": "utf8mb4", "connectionLimit": 10, "database": "feedbase", "raise_on_warnings": true }
Did the ping, same problem as before.
I have to take a break now, wil be back in about an hour.
I also have MySQL Workbench running on this machine, btw. I don't know how to use it, but I fumble around trying stuff out. Eventually I'll figure it out. ;-)

If you used utf8 for the alter, use utf8 for the connection charset too.
Or use utf8mb4 everywhere. (In the alter and the connection )
JY
Le jeu. 5 avr. 2018 à 22:23, Dave Winer [email protected] a écrit :
I did this --
"database": { "host": "localhost", "port": 3306, "user": "root", "password": "xxx", "charset": "utf8mb4", "connectionLimit": 10, "database": "feedbase", "raise_on_warnings": true }
Did the ping, same problem as before.
I have to take a break now, wil be back in about an hour.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379064682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBDPqJ-v-UmV3oV0ZIbTC69bI9A0Yks5tln0ngaJpZM4S_7gn .
-
do use utf8mb4 by the way. You don’t want a bug because of Poo. :)
https://mathiasbynens.be/notes/mysql-utf8mb4
-
in MySQL workbench, on the lower left you see « Schemas ». Click on the « sys » database and you’ll be able to list the tables. Click on the little « i » buttons next to table names. That will open a new tab where you can see the table structure. (Columns and their encodings. )
JY
Le jeu. 5 avr. 2018 à 22:24, Dave Winer [email protected] a écrit :
I also have MySQL Workbench running on this machine, btw. I don't know how to use it, but I fumble around trying stuff out. Eventually I'll figure it out. ;-)
[image: image] https://user-images.githubusercontent.com/1686843/38389905-c3046946-38ed-11e8-9451-284bd5a0f337.png
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/scripting/feedBase/issues/22#issuecomment-379065015, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIFBNbydrYbjco5ckT6VDInxqZZYwvoks5tln1rgaJpZM4S_7gn .
Use also utf8mb4_general_ci for the collation. (Note the mb4 part)
CHARACTER SET utf8mb4 COLLATE utf8mb4 _general_ci
JY
Let's review the commands I use to create the database and the tables. I have them documented. Let's get them really dialed in. Then I'll re-create the database, on my Mac, re-import the data, do some pings. Once it works, I'll do the same on the main server, and hopefully we'll have something we can build on for years to come.
Here are the docs for the tables.
https://github.com/scripting/feedBase/blob/master/docs/database.md#sql-commands-to-create-the-tables
And here are the docs for config.json:
https://github.com/scripting/feedBase/blob/master/docs/config.md
I already know that I have to add a charset value in the database sub-object.
I'm back at work this morning.
I did a search on encodings and came up with this article.
https://mathiasbynens.be/notes/mysql-utf8mb4
It seems to have it covered for updating an existing database. I'm still looking for an article that covers starting a new database with new tables with utf8mb4 encoding.
BTW, no question utf8mb4 is the proper encoding to use.
You can use a default charset and collation for a table:
CREATE TABLE t1
(
[... columns definitions...]
) DEFAULT CHARACTER SET utf8mb' COLLATE utf8mb4_general_ci;
https://dev.mysql.com/doc/refman/5.7/en/charset-examples.html
Not sure about the right collation to use, because people turn to utf8_unicode_ci or utf8_general_ci (when using the utf8 charset) :
https://stackoverflow.com/a/367725
Someone mentions:
Update: For newer versions, recommend utf8mb4 and utf8mb4_unicode_520_ci. These give you the rest of Chinese, plus improved collation
https://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php/367725#comment59285198_367725
See the example 4 in the previous link on MySQL.com, you can also set a default charset/collation for the database (instead of repeating it at the tables level)
CREATE DATABASE d1
DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
c1 CHAR(10)
);
But i prefer to be explicit at the table level, as it's also what mysqldump does when it dumps the CREATE statements.
More details here:
https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html
Thank you JY! ;-)
But I'm confused. What is latin2? I'm using utf8mb4, right?
Yes yes, sorry. i copy pasted their example for the DEFAULT CHARACTER SET syntax, but do use utf8mb4 when seeing another charset like latin1 etc...
BTW, another question.
When I recreate the tables, I want to use TEXT type instead of VARCHAR.
I don't like having to specify the size limit.
It seems I can't use TEXT for a key field.
