conflux icon indicating copy to clipboard operation
conflux copied to clipboard

Parsing zip files in a streaming manner

Open jimmywarting opened this issue 6 years ago • 2 comments

...without reading the hole content of the zip file

think we need to read it backwards, read the last bytes that tells how large the central dictionary is read that many bytes. (not exactly sure)

old stuff

Some basic idea of how it could work.

class ZipEntry {
  name = ''
  offset = 0
  size: = 0 // or length
  zip64 = true || false
  comment = ''
  stream () {
    // returns a new stream that read to-from something and inflate
  }
}

// some kind of zip parser based on ReadableStream that enqueues ZipEntries?
new ReadableStream({
  async start () {
    // return a promise of something
  }, 
  async pull (ctrl) {
    // user are ready to get the next entry
    // read/parse central directory or something
    ctrl.enqueue(new ZipEntry(...))
  }
}).pipeTo(new WritableStream({
  write (zipEntry) {
    // do something with zip entry
    if (zipEntry.name.includes('.css')) {
      // new Response(zipEntry.stream()).blob().then(URL.createObjectURL).then(appendToDOM)
    } 
  }
}))
  • [ ] decided upon a public api
  • [x] read zip64 format ( zip64 )
    • [x] read from from central zip64 dir?
  • [x] read compressed data by inflating (#30)
  • [x] read from from central dir
  • [x] read copied data (none deflated data)
  • [ ] read from readableStream (start to end)
  • [ ] prepended data (a.zip + b.zip)
  • [ ] read encrypted entries

jimmywarting avatar Jul 03 '19 15:07 jimmywarting

I have 3 proposal of how we could create our read api. first i would like to give some background so you can understand some of the obstacle of reading a zip file.

The best solution is to read the end of the file and being able to seek/jump too different places multiple times. To do that we need to know the length of the file either by blob/file size or content-length header or some sort of fileLike object. (Doing something like zip js did with it's TextReader, BlobReader, and HttpReader But unlike zip.js we read content with readable stream instead of doing multiple FileReads or range requests and our reader class could be a FileLike object instead providing us with 3 required items size, stream() and slice())

Doing a slice on http would just clone a Request object and just change content-range header.

But sometimes you don't always know the zip content length. So you need to get the hole body before you are able to read something, for example.

  • our zip writer produce a readableStream and the size is unknown.
  • a remote zip isn't always able to accept a partial request and has no content-length information
    getting a zip from any github master repo gives you a streamable zip and the size is unknown

So if the size (in our FileLike object) is not a number (could be NaN, null, or undefined) then we would read the content using only stream() from start the end and never use slice. (it will be less practical but could work) or the reader could accept two types of objects (FileLike or ReadableStream)

// using async iterator (current form)
for await (const entry of read(blob || readableStream)) {
  console.log(entry)
}

// just a iterator that returns promise
for (const it of read(blob || readableStream)) {
  const { value: entry, done } = await it
}

read(blob || readableStream).pipeTo(new WritableStream({
  write (entry) {
    console.log(entry)
  }
}))

// this could be fine for blob object but not so much for streams
const entries = await read(blob || readableStream)
const entry = entries[0]
console.log(entry)

jimmywarting avatar Jul 12 '19 09:07 jimmywarting

fyi, i have figured out how to read zip64 formats now. (a bit more complicated) but i can grasp it now) I have almost succeeded to read a zip64 file correctly, just have to get the right (size) information from "extraFields"

When i have manage to read it i can start working on making a zip64 file.

jimmywarting avatar Jul 24 '19 21:07 jimmywarting