conflux
conflux copied to clipboard
Parsing zip files in a streaming manner
...without reading the hole content of the zip file
think we need to read it backwards, read the last bytes that tells how large the central dictionary is read that many bytes. (not exactly sure)
old stuff
Some basic idea of how it could work.
class ZipEntry {
name = ''
offset = 0
size: = 0 // or length
zip64 = true || false
comment = ''
stream () {
// returns a new stream that read to-from something and inflate
}
}
// some kind of zip parser based on ReadableStream that enqueues ZipEntries?
new ReadableStream({
async start () {
// return a promise of something
},
async pull (ctrl) {
// user are ready to get the next entry
// read/parse central directory or something
ctrl.enqueue(new ZipEntry(...))
}
}).pipeTo(new WritableStream({
write (zipEntry) {
// do something with zip entry
if (zipEntry.name.includes('.css')) {
// new Response(zipEntry.stream()).blob().then(URL.createObjectURL).then(appendToDOM)
}
}
}))
- [ ] decided upon a public api
- [x] read zip64 format ( zip64
)
- [x] read from from central zip64 dir?
- [x] read compressed data by inflating (#30)
- [x] use pako?
- [x] read from from central dir
- [x] read copied data (none deflated data)
- [ ] read from readableStream (start to end)
- [ ] prepended data (a.zip + b.zip)
- [ ] read encrypted entries
I have 3 proposal of how we could create our read api. first i would like to give some background so you can understand some of the obstacle of reading a zip file.
The best solution is to read the end of the file and being able to seek/jump too different places multiple times. To do that we need to know the length of the file either by blob/file size or content-length header or some sort of fileLike object.
(Doing something like zip js did with it's TextReader, BlobReader, and HttpReader But unlike zip.js we read content with readable stream instead of doing multiple FileReads or range requests and our reader class could be a FileLike object instead providing us with 3 required items size, stream() and slice())
Doing a slice on http would just clone a Request object and just change content-range header.
But sometimes you don't always know the zip content length. So you need to get the hole body before you are able to read something, for example.
- our zip writer produce a readableStream and the size is unknown.
- a remote zip isn't always able to accept a partial request and has no content-length information
getting a zip from any github master repo gives you a streamable zip and the size is unknown
So if the size (in our FileLike object) is not a number (could be NaN, null, or undefined) then we would read the content using only stream() from start the end and never use slice. (it will be less practical but could work) or the reader could accept two types of objects (FileLike or ReadableStream)
// using async iterator (current form)
for await (const entry of read(blob || readableStream)) {
console.log(entry)
}
// just a iterator that returns promise
for (const it of read(blob || readableStream)) {
const { value: entry, done } = await it
}
read(blob || readableStream).pipeTo(new WritableStream({
write (entry) {
console.log(entry)
}
}))
// this could be fine for blob object but not so much for streams
const entries = await read(blob || readableStream)
const entry = entries[0]
console.log(entry)
fyi, i have figured out how to read zip64 formats now. (a bit more complicated) but i can grasp it now) I have almost succeeded to read a zip64 file correctly, just have to get the right (size) information from "extraFields"
When i have manage to read it i can start working on making a zip64 file.