DiscordChatExporter icon indicating copy to clipboard operation
DiscordChatExporter copied to clipboard

Export message content without markdown parsing

Open fghsgh opened this issue 2 years ago • 11 comments

Flavor

No response

Export format

No response

Details

I would like an option to export message contents in raw format. That is, by not substituting custom emote names (<:name:id>), channels (<#id>), mentions (<@id>) and so on. As it is now, the contents are converted to a human-readable format, but especially for JSON and CSV formats, it might be desirable to have the option to store the messages as they are actually sent. Discord markup is a subset of this that is exported raw in the text formats but parsed for HTML, so there is certainly a precedent for this. And it would increase the amount of information that could be gotten out of the exports. I am using the CLI, but AFAIK, the GUI doesn't have an option for this either. For the CLI, I imagine it could be implemented as another option switch.

fghsgh avatar Nov 21 '21 17:11 fghsgh

What is your usecase for this?

Tyrrrz avatar Nov 21 '21 23:11 Tyrrrz

One example would be downloading all of a server's custom emotes by sending a message containing all of them and parsing the <:name:id> format with a regex to get the download link and filename.

Another would be to have things like #deleted-channel actually give you more information than just "this channel was not available at the time of the data export", plus I tend to manually enter <#id> to refer to channels on servers other than the current one (which also shows up as #deleted-channel, even though it shows up fine in both the desktop and mobile clients).

Mentions of users can also be confusing because they could potentially have the same nickname, or they could change nick regularly. Plus, parsing these programmatically is just impossible anyway because these names could contain characters such as spaces as well, meaning you wouldn't be able to know where the name ended. And considering JSON is supposed to be parsed by a program anyway, I would expect it to behave in a program-friendly manner, retaining as much information as possible of the original message.

This last point is valid for all of them, really. In the JSON format (and all non-HTML ones, for that matter), how can I know if :thonk: is an emote or if someone just happened to type that text without it being an emote? How do I know, if I see #general-chat, whether the channel name is just #general and someone added -chat after it (not being part of the link, that is)?

Basically, as a technical Discord user myself, I am looking for a way to get the raw, plain text data of messages, for a multitude of reasons, which are certainly not limited to the ones listed above. I'd like to archive all the data, not just the data a normal user may find of importance.

Your title change to "without markdown parsing" seems to be a misunderstanding as well. I said that markdown parsing is already disabled when exporting to non-HTML formats. I was talking about how some other, non-formatting things, are still parsed. I guess it depends on how broadly you take your definition of markdown.

fghsgh avatar Nov 22 '21 01:11 fghsgh

I also have a use-case where this would be useful—rather than parsing channel ID references into channel names (potentially losing data if the program doesn't have access to the channel), it would be preferable to just save the message text raw.

omegasome avatar Feb 24 '22 05:02 omegasome

#553 would fix this, but it was closed as wontfix; I think that the litany reasons provided, both here and in that PR, should be more than enough for Tyrrrz to reconsider.

omegasome avatar Mar 02 '22 02:03 omegasome

The reason I'm against such feature requests is that they go against the fundamental purpose of the project – which is to export and preserve conversations between people. The described use case above indicates that the goal there is instead to scrape all emoji from the server. This sounds like it would be outside the scope of the project, and I'd recommend just using Discord API to achieve that, which would also be easier too.

Tyrrrz avatar Aug 25 '22 19:08 Tyrrrz

"scraping emoji" was only mentioned in the first line of my comment - quite literally not even the first 10% of bytes in that message. Please actually read what I have to say. The rest should have made it clear that I just care about archiving all of the data, including parts the exporter currently deems not important.

fghsgh avatar Aug 25 '22 20:08 fghsgh

Ok, sorry, I didn't read the rest of it. Also, when I mentioned "markdown parsing", that included mentions and emoji too, as they fall under the same umbrella in the content of Discord.

Tyrrrz avatar Aug 25 '22 20:08 Tyrrrz

From the UX perspective, we also don't have a way of providing format-specific options. We'd have to figure out how to introduce this aspect to the design.

Tyrrrz avatar Aug 25 '22 21:08 Tyrrrz

In my opinion, JSON exports should out-of-the-box refer to all channels, emoji, users, etc. as IDs, since that's the purest form of the data. Converting all of those to text names is not really helpful for automation, and I believe actually is a hindrance since data is lost—not to mention the complications that arise when names contain spaces. As for HTML exports, if we have any missing IDs we could add them as a data attribute on the corresponding elements. This way we wouldn't need to add an extra switch.

96-LB avatar Aug 25 '22 22:08 96-LB

In my opinion, JSON exports should out-of-the-box refer to all channels, emoji, users, etc. as IDs, since that's the purest form of the data. Converting all of those to text names is not really helpful for automation, and I believe actually is a hindrance since data is lost—not to mention the complications that arise when names contain spaces. As for HTML exports, if we have any missing IDs we could add them as a data attribute on the corresponding elements. This way we wouldn't need to add an extra switch.

Makes sense. What about plain text format?

Tyrrrz avatar Aug 25 '22 22:08 Tyrrrz

Honestly, I'm really ambivalent to whatever happens to the TXT format, because I don't find it very useful (to me personally) to begin with. If I had to guess, users who export plaintext archives are more interested in it as a lightweight option for presentation rather than automation, and it would probably be better to avoid swapping in ID's for everything. Might be best to just leave that the way it is now, since it's very information-light to begin with. As for CSV, I don't know if you have any interest in updating it at all given its obsolescense but I would consider that an automation format (i.e. one that would benefit from using raw IDs).

Happy to hear any other opinions on this though.

96-LB avatar Aug 25 '22 22:08 96-LB

I have a use case for this I'm particularly excited for:

My D&D group uses Discord text channels/threads to do roleplay/dialogue sessions in a non-live format. Several of us keep session notes in Microsoft OneNote.

It is my intent to develop a CronJob that runs at 0001 every day, exporting all messages from the previous day in raw JSON to preserve all information about the messages. These dumps will be stored in GoogleDrive where all party members can access.

These JSON files would then be imported by a OneNote plugin I'm developing. Each user would be able to use the plugin to format the messages into their OneNote notebooks to their particular preference. This is why it's important to make no assumptions about format and download everything raw from Discord.

matrumz avatar Sep 16 '22 18:09 matrumz

@matrumz why don't you just use the Discord API directly in your plugin?

Tyrrrz avatar Sep 16 '22 19:09 Tyrrrz

That's a really good point.

The reason why I wasn't thinking about that is I know right now I don't have a ton of time to work on the plugin, but writing the CronJob that uses this tool for my k3s cluster is a 10 minute thing, and a python script to format it is 15 minutes, and we could use that until I got the plugin functional...

But you make a fair point: I can just do the really simple version of this in a v0.0.1 version of the plugin.

I'll go that route for my work, but I would still think the feat requested in this thread would be high-value for any number of projects.

Thanks for your time & suggestions.

matrumz avatar Sep 16 '22 20:09 matrumz

Another case for favoring raw data is for when you want to import this data in another application. There's a discussion going on writing a Discord -> Zulip data importer tool which could use this project as its base. That would be much easier if at least the JSON format gave all the raw data, since tools like Zulip have their own markdown parsers as well, and converting one raw syntax to another raw syntax is much easier.

aero31aero avatar Sep 28 '22 07:09 aero31aero

This issue may also cover #415 #408

Tyrrrz avatar Feb 06 '23 12:02 Tyrrrz