chat-analytics icon indicating copy to clipboard operation
chat-analytics copied to clipboard

Allow spliting the data portion into multiple files.

Open Nodja opened this issue 1 year ago • 3 comments

I wanted to host a chat analytics page on github pages since GH Pages seems ideal to host a static file like this, but the discord server I'm using is almost 400MB of data and there's a 100MB file size limit. I can use LFS but I've run into bandwidth issue in the past.

The solution would be to split the data into separate files.

I've managed to do this manually, but it would be ideal if this was supported natively, since my method is not optimal. Here's how I achieved it:

  • Generate a report.html file like normal
  • Extract the contents of the script tag with the 'data' id to a separate report.txt file, leave the
  • Split the file into 25MB chunks. I used a python script
split.py
chunk_size = 25 * 1024 * 1024

prefix = "report"

with open(prefix + ".txt", "rb") as input_file:
    chunk_number = 1
    while True:
        chunk_data = input_file.read(chunk_size)
        if not chunk_data:
            break

        output_file_name = f"{prefix}-{str(chunk_number).zfill(3)}.txt"

        with open(output_file_name, "wb") as output_file:
            output_file.write(chunk_data)

        chunk_number += 1
  • Load the split files using a new script tag, the script tag needs to be in the <head> tag so it loads before anything else. I used an external file like so: <script src="loader.js"></script> but inline is fine.
loader.js
var files = [
  "report-001.txt",
  "report-002.txt",
  "report-003.txt",
  "report-004.txt",
  "report-005.txt"
  // etc.
];

var data = "";

for (let i = 0; i < files.length; i++) {
  let file = files[i];
  let rawFile = new XMLHttpRequest();
  rawFile.open("GET", file, false);
  rawFile.onreadystatechange = function () {
    if (rawFile.readyState === 4) {
      if (rawFile.status === 200 || rawFile.status == 0) {
        data += rawFile.responseText;
      }
    }
  };
  rawFile.send(null);
}

var data_script = document.getElementById("data");
data_script.innerHTML = data;

That's it. The only issue is that rendering is completely locked until all the chunks are downloaded. It would be best if loading was supported natively instead of a hack like this.

Nodja avatar Jan 04 '24 01:01 Nodja

I think it would be good if logical partitions were used, too. That way, only the data needed for whichever tab a user is viewing would be loaded which would reduce bandwidth since you can assume most viewers (at least of a publicly hosted report, which is the only use case I can think of this) aren't going to be looking in every tab

hopperelec avatar Jan 04 '24 02:01 hopperelec

I think its a good feature, we can add an option like "split data files into [X MB] parts" or something. We can do it when I eventually get to the configuration UI, I'll leave this open 😄


That way, only the data needed for whichever tab a user is viewing would be loaded

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

mlomb avatar Jan 06 '24 19:01 mlomb

Every card analysis is generated on the fly using the full database, we can't "split it by tab"

I wasn't meaning splitting by tab, I was meaning splitting by data type. One file could have all the basic info such as all this, which are used in just about all tabs https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/Types.ts#L12-L26 https://github.com/mlomb/chat-analytics/blob/055c68c78e12c6e3f32cb9137135e426c49a64bf/pipeline/process/Types.ts#L37-L40 but then another file could store words, which I believe is only used by the "Language" tab, and another file could store domains, which I believe is only used by the "Links" tab. Although, I do now realise that the majority of the data is probably just going to be words lol. However, I do still think it could be a good idea to store words in separate file(s) since most tabs don't require them, and only load those file(s) if the user visits a tab which does require them.

hopperelec avatar Jan 07 '24 03:01 hopperelec