openvsx Set content encoding for files stored on cloud storage

Currently we do not set the content encoding for text files stored on a cloud provider, e.g. readme, license, ...

Which has the side-effect that browsers need to guess the encoding when displaying the files and they are mostly wrong about it.

We should explicitly set the content encoding when uploading a file (there should be support for that by each cloud provider).

Additionally, we need to add some migration procedure for existing files.

https://github.com/EclipseFdn/open-vsx.org/issues/1122
#1346

Dec 05 '25 22:12 netomi

@netomi can you assign this issue to me?

Dec 10 '25 18:12 siddharthbaleja7

we dont assign tickets to outside contributors, but you are free to submit a PR ofc.

Dec 10 '25 18:12 netomi

Hi @netomi, I’m exploring this issue as part of my GSoC 2026 preparation and wanted to discuss my approach before starting a PR.

My idea directly addresses the problem description and the related issue #1122:

Prevent Future Issues
The root cause is that text files (e.g., README, LICENSE) are uploaded with a Content-Type like text/plain without specifying the character encoding.
My solution is to modify StorageUtil.java to include ;charset=utf-8 in the Content-Type header. This ensures browsers render special characters correctly.
Fix Existing Files
Changing the code only affects new uploads; existing files on S3/Azure/Google Cloud still have the wrong metadata.
My plan is to create a small migration task (ContentEncodingMigration.java) that:
- Identifies affected text files in the database
- Downloads and re-uploads them to update the metadata with the correct charset

This approach follows the suggestion in the issue comments: "We should explicitly set the content encoding when uploading a file... Additionally, we need to add some migration procedure for existing files."

Before I start, I wanted to confirm that this approach aligns with the project’s expectations. Any feedback is appreciated!

Dec 14 '25 16:12 siddharthbaleja7

Hi @siddharthbaleja7 thanks for the interest in this ticket.

The solution to explicitly set the content encoding for uploading new files to the storage provider sounds good. To fix existing files we will need to apply a different approach. In large instances of openvsx there are more than a million files stored, so downloading them will not work imho.

Dec 14 '25 16:12 netomi

Thanks for the feedback! You are absolutely right , downloading and re-uploading millions of files would be too resource-intensive.

I propose updating the migration strategy to perform an in-place metadata update instead. Most cloud storage providers allow updating the Content-Type without transferring the file content:

AWS S3: Use CopyObject where the source and destination are the same, with MetadataDirective.REPLACE.
Azure Blob: Use setHttpHeaders to update the properties directly.
Google Cloud: Use blob.update() to modify the metadata.

This approach only makes lightweight API calls for the affected text files (README, LICENSE, etc.) to set the correct charset=utf-8, avoiding heavy data transfer.

Does this approach sound reasonable?

Dec 14 '25 16:12 siddharthbaleja7