DiscordChatExporter
DiscordChatExporter copied to clipboard
Allow specifying a common "assets directory" for exports that include downloading media
Flavor
No response
Export format
No response
Details
Currently when downloading included media, all the files are downloaded into a directory with a name matching that of the exported file. That works great for single channel exports, but when exporting a guild or even just a few channels there's quite a bit of media that would be reused (e.g. avatars and the server icon). Coupled with the existing option to reuse existing media, by providing the option to speficy a "common directory" for these files we can save on both download bandwidth and storage space, which is great for large exports.
Bonus points if we can find a way to identify attachments/embedded media which have been duplicated in multiple channels, for embeds we could perhaps start by checking the URL from the referenced message, although for attachments the only way I can see would be to download and generate a file hash, then check for a match against an existing list of downloaded file hashes stored alongside the downloaded media. Not really sure if the additional time taken to compute all the file hashes would really outweigh the storage space reclaimed, especially since we'd need to download the file to hash it in the first place and gimmicks with Discord's reencoding would likely cause false negatives on images and videos. Avatars and the server icon should be fairly straightforward, though.
Sounds like a great idea. To show an example I ran a duplicate search through my exports. The results:
Total file count: 32276
Duplicate file count: 27102
Total files sizes: 5.44 GB
Duplicate files sizes: 3.15 GB
and I'm sure there are people with way more files out there.
I'm in favor of adding this to the CLI, but the GUI would require more work
I'm in favor of adding this to the CLI, but the GUI would require more work
It does seem like a feature that would benefit CLI users the most, especially since "download all media" is already hidden behind a settings menu in the GUI, but wouldn't it make sense to try to maintain feature parity between the two?
I'm in favor of adding this to the CLI, but the GUI would require more work
It does seem like a feature that would benefit CLI users the most, especially since "download all media" is already hidden behind a settings menu in the GUI, but wouldn't it make sense to try to maintain feature parity between the two?
It would of course, but GUI changes are just way more expensive to make.
Piggybacking off of this, is there a reason export directories are tagged with export format? If I export a server both as HTML and JSON with --media
, why does it have to save two copies of everything to Server - Category - channel [id].html_Files
and Server - Category - channel [id].json_Files
when it could just as easily keep them all in Server - Category - channel [id].Files
?
Obviously a naïve implementation would download the files twice (clobbering the original set with an identical set), but that would still be preferable to unnecessary dupes.
No very specific reason, but it was done to match Chrome's behavior for when you save a web page with files which was previously what people did to create a self-contained export
This is also a feature I would like to see implemented
i currently use the script to do a daily export of a very active channels and I end up with gigs of duplicate media.
hope you’ll have some time to work on this soonly
I don't think users would need to specify a folder that serves as their "common assets directory," I think we could just create a "Common assets" folder on each export including avatars, the group/server icon, and emojis.
This might be a stupid idea but since it just popped up in my head: What if we didn't create media folders for each channel, but one in each directory? E. g. the export of a guild with 20 channels would still create the 20 channel export files, but instead of also creating 20 folders, it'd just create one media folder. Could that solve the problem of duplicates because when exporting a file that had already been exported somewhere else, it'd just overwrite the old one?
But on the other hand, there might be some use cases for having separate media folders that went over my head...
So the existing approach was designed to replicate the behavior of Google Chrome when you saved a web page as Web page, complete
(that saves the HTML along with all referenced assets). I found that model to be applicable to DCE because I thought that people would want to treat each individual export as an independent piece that they can move around (along with the corresponding assets directory) and put it on a USB drive or something.
Hmm specifying a HTML <base> Tag could let people move the assets directory kinda freely, only the base would need to be changed in the HTML files if one wants to place th directory somewhere else. But what happens on two files with the same name, but different checksum?
I found that model to be applicable to DCE because I thought that people would want to treat each individual export as an independent piece that they can move around (along with the corresponding assets directory) and put it on a USB drive or something.
Good point.
I recently added a command line option to my fork which allows for a media directory path to be specified independently of the regular output path—would this be an acceptable solution to this issue? If so I can touch it up and make a PR.
I recently added a command line option to my fork which allows for a media directory path to be specified independently of the regular output path—would this be an acceptable solution to this issue? If so I can touch it up and make a PR.
Yeah, I think that's good.
@marens101 and others who commented in favor of this feature, can you please give feedback on using this heuristic to decide whether to use relative or absolute paths in HTML export?
https://github.com/Tyrrrz/DiscordChatExporter/pull/989#issuecomment-1400531102:
We can probably check if the assets path is outside the output directory (i.e. higher), then use the absolute path, and relative otherwise. But I'm not sure if it's a good heuristic, either.
That means if your --output
path is c:/foo/bar/export.html
and --mediaDir
path is c:/foo/bar/assets/
, then the export will have relative paths. If the --mediaDir
path is something like c:/assets/
instead, then the export (still with the same path) will use absolute paths to reference assets.
Comments?
I don't know if you'd care about other feedback or not but I'd say relative because if the discord downloads were moved to a different drive or the mountpoint is changed or other folder structure changes in general, the exports would break otherwise. When I was wanting to do this before, it made the most sense to put the assets somewhere in the parent directories of where the htmls/jsons were. The paths have changed since then so it would have broken by now.