Figure out artifact extensions using Apache Tika
We need to be able to figure out the extensions of artifact files. It seems that Apache Tika is one way to do this. Doing a simple path.subString(path.lastIndexOf(File.separatorChar), path.length)) is not good enough, because there are certain extensions that are more complex like for example tar.gz, tar.bz2 and the likes.
This will probably have to be implemented in the LayoutProvider implementations.
Assuming I create a method findExtension(Argument: argument) -
- What should be the return type - String name of the extension - ex - "xlsx" or "tar.gz"?
- What would be the input argument file or list of files(will return list or Hashmap<fileName, extension> ?
I think it should be a method in 'RepositoryPath' without parameters and extension string result.
Assuming I create a method findExtension(Argument: argument) -
- What should be the return type - String name of the extension - ex - "xlsx" or "tar.gz"?
Yes. Examples: foo-1.2.3.tar.gz --> tar.gz foo-1.2.3.gz --> gz foo-1.2.3.tar.bz2 --> tar.bz2 foo-1.2.3.jar --> jar foo-1.2.3.zip --> zip
- What would be the input argument file or list of files(will return list or Hashmap<fileName, extension> ?
As far, as I'm aware Apache Tika works with InputStream-s. What I would like is this:
- We currently have
ArtifactInputStream(this is what is used for the artifacts to be read; it also calculates the artifact's checksums in realtime) - It would be great, if we could somehow use Tika from here while the artifact is being read (when it's being deployed to strongbox), instead of having to re-read it, once it's written to the underlying file system. (My understanding is that Tika needs to read the
InputStream, normally, once it's already written to the underlying system
We will need this when storing information about the artifact in the OrientDB database and the Maven Indexer. When storing an artifact the respective ArtifactCoordinates's extension (if this field is supported) will have to be updated. (At the moment, the MavenArtifactCoordinates and NugetArtifactCoordinates have such a coordinate which will have to be updated, if the extension is guessed).
Got it. I can take this up and start working on it.