exifr icon indicating copy to clipboard operation
exifr copied to clipboard

Add ExternalReader that delegates all reading to an external function

Open mbklein opened this issue 3 years ago ā€¢ 5 comments

Description

This PR adds an option, externalReader, which takes over the task of reading files/chunks from the built-in readers.

Example

In the following example, the input to exifr is an object pointing to a file stored in an Amazon S3 bucket. The custom reader function uses the AWS SDK to grab the chunks exifr needs. With this functionality, the input given to exifr can be literally anything, as long as the supplied externalReader function knows what to do with it.

const AWS = require("aws-sdk");
const exifr = require("exifr");
const s3 = new AWS.S3();

function chunkReader(input, offset, length) {
  return new Promise((resolve, _reject) => {
    let params = {...input};

    if (typeof offset === 'number') {
      let end = length ? offset + length - 1 : undefined;
      params.Range = `bytes=${[offset, end].join('-')}`;
    }

    s3.getObject(params, (err, data) => {
      if (err) {
        console.error(err);
        resolve(undefined);
      } else {
        resolve(data.Body);
      }
    });
  });
}

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" }, { externalReader: chunkReader })
  .then((exif) => console.log(exif))
  .catch(err => reject(err));

mbklein avatar Mar 09 '21 18:03 mbklein

@MikeKovarik Just popping in to say that weā€™ve been using this code in production for a couple weeks now, and have used it to process more than 200,000 TIFF images of various sizes from 100MB up to almost 4GB. Itā€™s been rock solid and remarkably fast.

mbklein avatar Mar 21 '21 14:03 mbklein

Hello @mbklein. I'm so sorry for responding this late. Glad to hear it's working for you. It's impressive!

I actually saw the PR on the day you submitted it and was about to merge it right away. You've hit a great idea with the externalReader and I really like the direct use of AWS instead of just making UrlFetcher work in nodejs. But because of this shift I started thinking about further improving the idea further. Because the user facing API could be better aligned with how exifr does modularity. So I got to work, changing a few things here and there, thinking "I'll wait with responding until there's something to bring to the table" and "I'll only do this one more change". And then the weekend came. And the next week. And here we are.

Anyway, back to the externalReader. I really like the idea but don't think that passing it through options object is the ideal way. Exifr already has a ground work for modularity in plugins.mjs in the form of Maps to which all the loaded parsers/readers are added (like this fileReaders.set('url', UrlFetcher))

My idea is to implemented custom reader class and add it to exifrs plugin mechanism. This way, I could also just ship the aws parser. Maybe I could finally turn exifr into monorepo and break it up into sepate packages like exifr-parser-jpeg and exifr-reader-aws. All reader classes, such as UrlFetcher, inherit from ChunkedReader. canHandle method is used verify if the input is parseable by the reader. When exifr.parse(input) is called exifr goes through all loaded readers to pick the one best suited for the input format. If that's an object with Key and Bucket, then the AwsReader will be used without the need to pass options object.

So this is the API I came up with but am open to suggestions.

import exifr from '../src/bundles/full.mjs'

export class ExternalReader extends exifr.ChunkedReader {

  async readWhole() {
    this.chunked = false
    let arrayBuffer = await this._read()
    this._swapArrayBuffer(arrayBuffer)
  }

  async _readChunk(offset, length) {
    let abChunk = await this._read(offset, length)
    let bytesRead = abChunk.byteLength
    if (bytesRead !== length) this.size = offset + bytesRead
    return this.set(abChunk, offset, true)
  }

}

class AwsReader extends ExternalReader {

  static canHandle(input) {
    return input !== undefined
      && typeof input.Bucket === 'string'
      && typeof input.Key === 'string'
  }

  _read(offset, length) {
    // the aws code here
  }

}

exifr.use(AwsReader)
// or
fileReaders.set('aws', AwsReader)

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" })
  .then(console.log)
  .catch(console.error)

For now it's still open to changes since I'm still changing the logic of ChunkedReader. Specifically the readWhole, readChunked and _readChunk methods. These were internal methods until now, but would need to make more sense to be implemented by user.


PS: This is out of the topic now, but another thing that delayed me was exploring other ways of making UrlFetcher work in nodejs. I ended up with simple shim of fetch implemented with node's built in http and https modules. Here's a snippet:

let httpPromise  = dynamicImport('http',  http => http)
let httpsPromise = dynamicImport('https', https => https)

let fetch = platform.g.fetch || function(url, {headers}) {
  let {port, hostname, pathname, protocol} = new URL(url)
  const options = {
    method: 'GET',
    hostname,
    path: encodeURI(pathname),
    headers
  }
  if (port !== '') options.port = Number(port)
  return new Promise(async (resolve, reject) => {
    let lib = protocol === 'https:' ? await httpsPromise : await httpPromise
    const req = lib.request(options, res => {
      resolve({
        status: res.statusCode,
        arrayBuffer: () => new Promise(resolveAb => {
          let buffers = []
          res.on('data', buffer => buffers.push(buffer))
          res.on('end', () => resolveAb(concatBuffers(buffers)))
        })
      })
    })
    req.on('error', reject)
    req.end()
  })
}

MikeKovarik avatar Mar 24 '21 06:03 MikeKovarik

I really like where you've gone with this, especially the idea of breaking up the repo and allowing people to require only the bits they need. It would certainly make it way easier to extend ā€“Ā just write and release a module! ā€“Ā over the current all-in-one.

One small suggestion I have would be to allow the caller to pass the name of a registered reader (or something) when they call parse() (e.g., exifr.parse(input, { reader: 'aws' }), allowing them to bypass all the decision logic if they know they want to use a specific implementation. I like ā€œjust do the right thingā€Ā code, but I also appreciate when there's a way to impose my will on the process. šŸ˜„

mbklein avatar Mar 24 '21 14:03 mbklein

Hi @MikeKovarik ā€“Ā just thought I'd check in and see if there's anything I can do to help with where you'd like this PR/issue/refactor to go. Our application is chugging along nicely here using our fork, but it'd be great to get back to the main release at some point.

mbklein avatar May 06 '21 14:05 mbklein

Any news on this? šŸ¤“

schonert avatar Sep 21 '22 11:09 schonert