strongbox icon indicating copy to clipboard operation
strongbox copied to clipboard

Figure out artifact extensions using Apache Tika

Open carlspring opened this issue 8 years ago • 4 comments

We need to be able to figure out the extensions of artifact files. It seems that Apache Tika is one way to do this. Doing a simple path.subString(path.lastIndexOf(File.separatorChar), path.length)) is not good enough, because there are certain extensions that are more complex like for example tar.gz, tar.bz2 and the likes.

This will probably have to be implemented in the LayoutProvider implementations.

carlspring avatar Oct 25 '17 00:10 carlspring

Assuming I create a method findExtension(Argument: argument) -

  1. What should be the return type - String name of the extension - ex - "xlsx" or "tar.gz"?
  2. What would be the input argument file or list of files(will return list or Hashmap<fileName, extension> ?

dinesh19aug avatar Dec 09 '17 04:12 dinesh19aug

I think it should be a method in 'RepositoryPath' without parameters and extension string result.

sbespalov avatar Dec 09 '17 04:12 sbespalov

Assuming I create a method findExtension(Argument: argument) -

  1. What should be the return type - String name of the extension - ex - "xlsx" or "tar.gz"?

Yes. Examples: foo-1.2.3.tar.gz --> tar.gz foo-1.2.3.gz --> gz foo-1.2.3.tar.bz2 --> tar.bz2 foo-1.2.3.jar --> jar foo-1.2.3.zip --> zip

  1. What would be the input argument file or list of files(will return list or Hashmap<fileName, extension> ?

As far, as I'm aware Apache Tika works with InputStream-s. What I would like is this:

  • We currently have ArtifactInputStream (this is what is used for the artifacts to be read; it also calculates the artifact's checksums in realtime)
  • It would be great, if we could somehow use Tika from here while the artifact is being read (when it's being deployed to strongbox), instead of having to re-read it, once it's written to the underlying file system. (My understanding is that Tika needs to read the InputStream, normally, once it's already written to the underlying system

We will need this when storing information about the artifact in the OrientDB database and the Maven Indexer. When storing an artifact the respective ArtifactCoordinates's extension (if this field is supported) will have to be updated. (At the moment, the MavenArtifactCoordinates and NugetArtifactCoordinates have such a coordinate which will have to be updated, if the extension is guessed).

carlspring avatar Dec 09 '17 11:12 carlspring

Got it. I can take this up and start working on it.

dinesh19aug avatar Dec 12 '17 04:12 dinesh19aug