tg-archive
tg-archive copied to clipboard
Store message content as HTML
Fixes #43
Messages are stored in the database as HTML. This preserves formatting such as bold, italic, underline, strikethrough, monospace and inline links.
Telegram links in a message to other messages (t.me/group/message_id) are replaced with their archival site version.
For example, t.me/example_group/12
becomes example_group/site/2022-02.html#12
.
Thanks. Will test this soon.
I tried this and it works, but I think the html template file should be edited to reflect the changes, as the html elements are not rendered.
The rss template though seems to be working just fine for now.
Actually nevermind - I was using the old template for the html website. The new template actually does work just fine.
Sorry, just got a chance to look at this. URLs aren't being rendered as hyerplinks anymore.
Fresh site created using --new
with this PR:
Current master
:
Are you sure you deleted the database and then synced again? Because otherwise you would be just applying the new code/template on the old raw text messages.
I used an existing database, but that shouldn't break existing links on existing installations. Re-syncing large channels may be impractical.
-
replace_msg_link()
can be renamed tourlize()
(like in the current version) and it can continue to convert non-<a>
URLs to links along with replacing Telegram group links like it is doing right now. -
This PR also involves changing the template, which means all existing installations will break after an upgrade, which isn't ideal. Have to come up with a way to avoid this.
I agree, starting over with large channels seems impractical. I could be missing something but what I understand is raw text should not be rendered without escaping and HTML cannot be escaped and I don't think it is possible to differentiate between raw text and HTML. I am unable to have a generalized urlize()
. This would also lead to both HTML and raw text being stored in the database, which doesn't sound nice to me.
I think we can have a 'formatted-message' config which is True by default, so that new sites preserve message formatting and the existing ones do not break. I will make the change and test it out. What do you guys think?
IF we make such a setting, I believe it should be set to True by default in NEW configurations by adding it to the example config.yaml file. However, if the setting does not exist in the config.yaml file (i.e. started with an old version) it should assume it is False.
Yeah, an html_messages: true
which is by default turned on for new setups should be fine.
Thanks for the feedback! Have made the change.
Almost works! One last quirk. Syncing with html_messages: True
saves HTML in the DB. If you then set it to False and rebuild the site, the HTML tags render as plaintext.
Well that makes sense because the setting is meant to be constant; otherwise normally all projects should be set to use the html one. The real point of the setting is to not break previous setups by preventing a mix of text and html messages.
If this behaviour is confusing, one solution would be to add description next to the option about its nature and that it should not be changed after the first sync.
Another solution would be to remove the option entirely, and pull all new messages as html, regardless of the previous ones. Then, when generating the html/rss templates, check each message to see if it is text or html, and handle it accordingly. I believe a good check would be to check the datatype, but really any method should suffice.
Actually now that I think about it, the second solution makes way more sense but for some reason I didn't think of it before.
Yep, the second option is better. The True/False
should only affect rendering HTML or escaped text.
Thanks @farzat, but with the option you suggest we need to determine whether a message is raw text or HTML. Consider these example messages,
msg.raw_text - <script src="main.js"/>
(requires escaping)
msg.text - <script src="main.js"/>
(already escaped, by Telethon I guess)
- This is html
msg.raw_text - This is html
msg.text - This is <em>html</em>
I think we can't reliably differentiate (as existing raw text messages could be like example 1) without storing some information:
- Currently, it is the
html_messages
config. Having instructions about its usage in the config file is definitely required as @farzat said. For switching, a user has to start over. This means complete site in one format. - A new message type -
html_message
alongsidemessage
in the messages table. The user can changehtml_messages
config. before syncing the site. This means mixed formats.
From these options, 1 seemed better to me because it is consistent. However, 2 is more flexible, especially for existing sites.
@knadh If we want to also have the ability to switch after syncing the site and then build, we will have to store both raw text (content) and HTML (content_html), right? This could increase the database size. Handling this for existing sites adds some cases as well. Storing HTML from now on and getting rid of tags is another way that came to my mind, but it is not a reliable option, consider example 1 raw text being synced before this change.
I don't think this can be the case. If you type out HTML tags, Telegram encodes it. The above message will come as <script src="main.js"/>
. Basically, the only valid HTML that we get should be HTML that Telegram has sanitized and generated (bold, italics etc.).
I guess Telegram returns MessageEntity
objects which is left for the client (Telethon) to handle and Telethon can give us raw text, HTML and Markdown using those objects. The raw_text
from Telethon for both the below messages is unescaped, whereas with text
it gets escaped.
This is with html_messages: True
Yes, as expected, html_messages: True
leads to message.text
being stored, which is already sanitized by Telethon. However, with html_messages: False
or in the absence of the config., unescaped raw text is stored and escaped during the build process, this is identical to the current master branch's behaviour. So, existing unescaped raw text messages cannot be differentiated from new HTML messages that this change will start storing unless we store some additional information like I described.
Ok I guess the way we could do this is just adding a new field in the database, for example specifying which version of tg-archive was used to sync this message. In our case, the type or content of the field doesn't matter, but its very existence suggests that this message was synced after this pull request, which means that the content is html, or otherwise plain text. This way, the option is also removed (all new messages are html), simplifying the config file.
This of course stems from my philosophy of limiting options to the user. Raw text support was brought up here simply for the sake of backwards compatability, so as long as we can keep backwards compatability without adding the option then the option shouldn't exist.
I think the example I chose of version number is especially useful because it will also be helpful in similar cases in the future as well. Empty fields represent versions older than whatever the current version is. However, as I mentioned in the last comment, for this very paritcular case, any field with whatever random content should also do the job.
I agree. Some information has to be stored somewhere which is ultimately due to the Telethon raw_text
behaviour that was discussed above.
If we store version number for future purposes other than this, it will require tracking which version introduced what feature/change. We have to check against a message's tg-archive version before doing something. I see how it can be used more generally, though.
Can you @knadh please check and confirm the message.raw_text
behaviour? It returns the content unescaped for me.
To summarize the options and their impact, we can have
- A config. option (
html_messages: True
), enforces a format on site level. New messages are in HTML if config. is present and True. Cannot be changed for a site without starting over. - A message type in messages table:
html_message
, enforces a format on message level. We can choose to allow swicthing before a sync usinghtml_messages
config. We can also choose to enforce HTML for existing sites or let them keep raw text unless they add the config manually. - tg-archive version for each message in the messages table, enforces a format on message level. New messages are in HTML by default for existing and new sites. No overrides.
- Start storing both. Possible to switch anytime instead of just before a sync. Larger databases.
The fourth option also has the advantage of keeping the raw text available in case we wanted to, for example, search the database. Personally I prefer the 3rd option the most (it's my idea after all), and then the 4th option.
Sorry, this discussion has been idle for a few months now. But I came back to this, a few days ago, thought of a different method - instead of the entire HTML message, storing only the message entities as JSON (e.g. {"bold": [[0, 5], [9, 2]]}
where 0 is the offset and 5 is the length and so on) and then reconstructing the entity objects and passing them to html.unparse() to get the HTML message. But different entities such as URLs can have more attributes than just offset and length, making the resulting JSON complex. Also, this would be more useful when the messages are known to be very long, otherwise in maximum cases storing both raw text and the HTML version is simple and better.
So, option 4 is the best in my opinion. We can close this in case it's not required anymore.