DiscordChatExporter icon indicating copy to clipboard operation
DiscordChatExporter copied to clipboard

Allow specifying a common "assets directory" for exports that include downloading media

Open marens101 opened this issue 3 years ago • 7 comments

Flavor

No response

Export format

No response

Details

Currently when downloading included media, all the files are downloaded into a directory with a name matching that of the exported file. That works great for single channel exports, but when exporting a guild or even just a few channels there's quite a bit of media that would be reused (e.g. avatars and the server icon). Coupled with the existing option to reuse existing media, by providing the option to speficy a "common directory" for these files we can save on both download bandwidth and storage space, which is great for large exports.

Bonus points if we can find a way to identify attachments/embedded media which have been duplicated in multiple channels, for embeds we could perhaps start by checking the URL from the referenced message, although for attachments the only way I can see would be to download and generate a file hash, then check for a match against an existing list of downloaded file hashes stored alongside the downloaded media. Not really sure if the additional time taken to compute all the file hashes would really outweigh the storage space reclaimed, especially since we'd need to download the file to hash it in the first place and gimmicks with Discord's reencoding would likely cause false negatives on images and videos. Avatars and the server icon should be fairly straightforward, though.

marens101 avatar Feb 18 '22 10:02 marens101

Sounds like a great idea. To show an example I ran a duplicate search through my exports. The results:

Total file count: 32276
Duplicate file count: 27102

Total files sizes: 5.44 GB
Duplicate files sizes: 3.15 GB

and I'm sure there are people with way more files out there.

Feathered-Serpent avatar Feb 21 '22 13:02 Feathered-Serpent

I'm in favor of adding this to the CLI, but the GUI would require more work

Tyrrrz avatar Feb 21 '22 14:02 Tyrrrz

I'm in favor of adding this to the CLI, but the GUI would require more work

It does seem like a feature that would benefit CLI users the most, especially since "download all media" is already hidden behind a settings menu in the GUI, but wouldn't it make sense to try to maintain feature parity between the two?

marens101 avatar Feb 21 '22 22:02 marens101

I'm in favor of adding this to the CLI, but the GUI would require more work

It does seem like a feature that would benefit CLI users the most, especially since "download all media" is already hidden behind a settings menu in the GUI, but wouldn't it make sense to try to maintain feature parity between the two?

It would of course, but GUI changes are just way more expensive to make.

Tyrrrz avatar Feb 22 '22 14:02 Tyrrrz

Piggybacking off of this, is there a reason export directories are tagged with export format? If I export a server both as HTML and JSON with --media, why does it have to save two copies of everything to Server - Category - channel [id].html_Files and Server - Category - channel [id].json_Files when it could just as easily keep them all in Server - Category - channel [id].Files?

Obviously a naïve implementation would download the files twice (clobbering the original set with an identical set), but that would still be preferable to unnecessary dupes.

omegasome avatar Feb 28 '22 02:02 omegasome

No very specific reason, but it was done to match Chrome's behavior for when you save a web page with files which was previously what people did to create a self-contained export

Tyrrrz avatar Aug 15 '22 21:08 Tyrrrz

This is also a feature I would like to see implemented

i currently use the script to do a daily export of a very active channels and I end up with gigs of duplicate media.

hope you’ll have some time to work on this soonly

mickaelperrin avatar Sep 14 '22 19:09 mickaelperrin

I don't think users would need to specify a folder that serves as their "common assets directory," I think we could just create a "Common assets" folder on each export including avatars, the group/server icon, and emojis.

This might be a stupid idea but since it just popped up in my head: What if we didn't create media folders for each channel, but one in each directory? E. g. the export of a guild with 20 channels would still create the 20 channel export files, but instead of also creating 20 folders, it'd just create one media folder. Could that solve the problem of duplicates because when exporting a file that had already been exported somewhere else, it'd just overwrite the old one?

But on the other hand, there might be some use cases for having separate media folders that went over my head...

CanePlayz avatar Jan 19 '23 13:01 CanePlayz

So the existing approach was designed to replicate the behavior of Google Chrome when you saved a web page as Web page, complete (that saves the HTML along with all referenced assets). I found that model to be applicable to DCE because I thought that people would want to treat each individual export as an independent piece that they can move around (along with the corresponding assets directory) and put it on a USB drive or something.

Tyrrrz avatar Jan 19 '23 20:01 Tyrrrz

Hmm specifying a HTML <base> Tag could let people move the assets directory kinda freely, only the base would need to be changed in the HTML files if one wants to place th directory somewhere else. But what happens on two files with the same name, but different checksum?

Feathered-Serpent avatar Jan 19 '23 20:01 Feathered-Serpent

I found that model to be applicable to DCE because I thought that people would want to treat each individual export as an independent piece that they can move around (along with the corresponding assets directory) and put it on a USB drive or something.

Good point.

CanePlayz avatar Jan 20 '23 19:01 CanePlayz

I recently added a command line option to my fork which allows for a media directory path to be specified independently of the regular output path—would this be an acceptable solution to this issue? If so I can touch it up and make a PR.

96-LB avatar Jan 20 '23 21:01 96-LB

I recently added a command line option to my fork which allows for a media directory path to be specified independently of the regular output path—would this be an acceptable solution to this issue? If so I can touch it up and make a PR.

Yeah, I think that's good.

Tyrrrz avatar Jan 20 '23 22:01 Tyrrrz

@marens101 and others who commented in favor of this feature, can you please give feedback on using this heuristic to decide whether to use relative or absolute paths in HTML export?

https://github.com/Tyrrrz/DiscordChatExporter/pull/989#issuecomment-1400531102:

We can probably check if the assets path is outside the output directory (i.e. higher), then use the absolute path, and relative otherwise. But I'm not sure if it's a good heuristic, either.

That means if your --output path is c:/foo/bar/export.html and --mediaDir path is c:/foo/bar/assets/, then the export will have relative paths. If the --mediaDir path is something like c:/assets/ instead, then the export (still with the same path) will use absolute paths to reference assets.

Comments?

Tyrrrz avatar Jan 24 '23 22:01 Tyrrrz

I don't know if you'd care about other feedback or not but I'd say relative because if the discord downloads were moved to a different drive or the mountpoint is changed or other folder structure changes in general, the exports would break otherwise. When I was wanting to do this before, it made the most sense to put the assets somewhere in the parent directories of where the htmls/jsons were. The paths have changed since then so it would have broken by now.

Twi-Hard avatar Jan 25 '23 01:01 Twi-Hard