SMQTK icon indicating copy to clipboard operation
SMQTK copied to clipboard

smqtk-check-images doesn't check against content types

Open danlamanna opened this issue 8 years ago • 8 comments

While this is possible, we have to think of a way the user can pass this in.

Right now it just checks that PIL can load it, however those using Caffe to generate descriptors might want to ensure the images are loadable and have a type from CaffeDescriptorGenerator.valid_content_types.

@Purg Do you think this should be passed in just as a comma separated list of mime types, or something more integrated with descriptor generators?

danlamanna avatar Mar 17 '17 18:03 danlamanna

They could technically both be options, but the practical purpose of the script is to sanitize input before going to a DescriptorGenerator instance, so I think it makes more sense to specify a DescriptorGenerator implementation to sanitize for. Of course, some descriptor generators may not work on images, so that's a point against that idea.

Purg avatar Mar 17 '17 18:03 Purg

Are we using libmagic for just the three file types JPG, PNG, and TIFF? If so, maybe we could just check for those directly by reading the byte header?

fishcorn avatar Mar 17 '17 21:03 fishcorn

The reason that I bring this up is that the last file-magic I tried made things crash because Python 2.7.6 doesn't like str('r',blah) or whatever (I assume it wants str(blah) only). If we're only checking a few bytes, why not just do this directly instead of relying on a pip install that might break?

fishcorn avatar Mar 17 '17 21:03 fishcorn

I'm not sure what error you're describing there with pip-installing file-magic. Maybe in some cases we can use some specific file tricks, but in general SMQTK is a data-agnostic framework. The png/jpeg/tiff specificity is only from the caffe DescriptorGenerator implementation (it can really support more types, I just never enumerated all of them out). Primarily what we have been working with recently has been images, but we have worked more with videos in the past and have entertained using text and sound files.

In regards to issues with file-magic and mime-type discovery, that itself can be an algorithm interface with multiple implementations (file-magic, py-tika, etc.) to be configured to use with a more general tool that can check files as needed by a certain descriptor generator.

I guess my point here is that from this issue we could craft a more general tool that can check more than just images with respect to a configured DescriptorGenerator implementation's needs.

Purg avatar Mar 17 '17 22:03 Purg

In regards to issues with file-magic and mime-type discovery, that itself can be an algorithm interface with multiple implementations (file-magic, py-tika, etc.) to be configured to use with a more general tool that can check files as needed by a certain descriptor generator.

I guess my point here is that from this issue we could craft a more general tool that can check more than just images with respect to a configured DescriptorGenerator implementation's needs.

I think this is a good idea if I catch your drift. It seems like you're suggesting exposing an algo interface that would expose a predicate (or something similar) that would approve an input for descriptor generation. The whole plugin (caffe in this case) would be in charge of implementing a CaffeInputValidator (say) for a CaffeDescriptorGenerator.

If this is the case, I think that's wise, simply because it gives the responsibility for validating inputs to the entity that wants them validated.

fishcorn avatar Mar 18 '17 15:03 fishcorn

Basically, but I meant a little more general. The DescriptorGenerator interface already has a method to get its "valid" data mimetypes, so a script can use that as input to validate a number of files (or other DataElements) via a mimetype detector (a new algorithm interface defining a, say, detect(bytes) method), whose implementations could include file-magic and tika to start.

Purg avatar Mar 20 '17 14:03 Purg

I think that the problem with that approach (if you rely only on the mimetype detector) is that the mimetype detector is then responsible for detecting inputs that aren't covered by mimetypes (such as any brand-new file type or specialty type).

What might be better is to go ahead and keep the mime-type detector around as a convenience, but let the plugin fall back to its own detection.

On Mon, Mar 20, 2017 at 10:23 AM Paul Tunison [email protected] wrote:

Basically, but I meant a little more general. The DescriptorGenerator interface already has a method to get its "valid" data mimetypes, so a script can use that as input to validate a number of files (or other DataElements) via a mimetype detector (a new algorithm interface defining a, say, detect(bytes) method), whose implementations could include file-magic and tika to start.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Kitware/SMQTK/issues/305#issuecomment-287774051, or mute the thread https://github.com/notifications/unsubscribe-auth/ABInCP3Hmmpixg1BqD07niU6z7vdkvi3ks5rnovjgaJpZM4Mg8gE .

fishcorn avatar Mar 20 '17 17:03 fishcorn

I think we're talking about separate things. There is input validation, which lies in the responsibility of the thing consuming data (e.g. the CaffeDescriptorGenerator), but there is also mimetype detection, which was the point of this thread and the script in question. A .caffemodel is a special form of a application/octet-stream, but its still a application/octet-stream. I also don't think that special file types (e.g. ".caffemodel" or ".npy") are not used as data inputs in the same way that jpeg or mp4 files would.

Purg avatar Mar 20 '17 17:03 Purg