dkpro-tc icon indicating copy to clipboard operation
dkpro-tc copied to clipboard

Writing CASes to a zip archive

Open daxenberger opened this issue 10 years ago • 20 comments

Originally reported on Google Code with ID 135

DKPro-Core 1.6.1. will support writing to ZIP archives using e.g. BinaryCasWriter. We
should make use of this feature:

[PreprocessingTask]

AnalysisEngineDescription writer = createEngineDescription(BinaryCasWriter.class,
BinaryCasWriter.PARAM_TARGET_LOCATION, "jar:file:" + root + "/archive.zip", 
BinaryCasWriter.PARAM_TYPE_SYSTEM_LOCATION, root + "/typesystem.bin",
BinaryCasWriter.PARAM_FORMAT, "6");

and likewise for the Meta- and FeatureExtractionTasks.

One problem remains: I am not sure whether this makes sense for the BatchTaskCrossValidation,
where we (currently) need to split the overall set of files into various folds (file
sets), that need to be retrieved individually in each fold.

Reported by daxenberger.j on 2014-05-28 12:41:02

daxenberger avatar Jun 09 '15 14:06 daxenberger

"root" points to the path on the file system. Unless you have a strong reason to store
the type system outside the ZIP, I suggest you remove the "root" from PARAM_TYPE_SYSTEM_LOCATION
and just set it to "typesystem.bin" (no slash). Relative type system locations are
placed inside the ZIP - absolute locations are placed directly on the file system.

Reported by richard.eckart on 2014-05-28 12:42:45

daxenberger avatar Jun 09 '15 14:06 daxenberger

Thanks for the hint. I don't see a reason to store the typesystem outside the ZIP, so
the location should be relative.

Reported by daxenberger.j on 2014-05-28 12:47:58

daxenberger avatar Jun 09 '15 14:06 daxenberger

Reported by daxenberger.j on 2014-06-04 16:09:40

  • Labels added: Milestone-Release0.7.0

daxenberger avatar Jun 09 '15 14:06 daxenberger

I wonder, didn't we plan to do this in 0.6.0? 

Reported by richard.eckart on 2014-06-25 15:04:57

daxenberger avatar Jun 09 '15 14:06 daxenberger

Because of the problem mentioned in the first post: I'm not sure how to integrate this
with the current Crossvalidation BatchTask.

Reported by daxenberger.j on 2014-06-25 15:09:46

daxenberger avatar Jun 09 '15 14:06 daxenberger

Ah, I see. It shouldn't be a big problem but it is probably too much for the 0.6.0 release.


The basic principle should remain the same. We'd just need some extra code to extract
the file names for the folds from the ZIP instead of scanning them from the file system.

Reported by richard.eckart on 2014-06-25 15:11:57

daxenberger avatar Jun 09 '15 14:06 daxenberger

Reported by daxenberger.j on 2015-01-06 11:40:17

  • Labels added: Milestone-Release0.8.0
  • Labels removed: Milestone-Release0.7.0

daxenberger avatar Jun 09 '15 14:06 daxenberger

@daxenberger this one can be closed as won't fix now, right?

Horsmann avatar Apr 30 '16 18:04 Horsmann

This is independent of the latest changes to CV mode. The idea here was to write all CASes into a zip archive rather than individual files.

Or why did you think it is obsolete?

daxenberger avatar May 02 '16 09:05 daxenberger

Oh ok, I misunderstood it then. Sry.

Horsmann avatar May 05 '16 13:05 Horsmann

@reckart Is this feature available now? What exactly is the benefit of writing a single .zip instead of N bin-cas? Both is not human-readable but the naming of the bin-cas by document name allows some visual confirmation that the reader read what it was supposed to read? It helps to understand at least a little bit what TC is doing. Unless this makes processing a lot faster I would rather not have zips?

Horsmann avatar Feb 09 '18 22:02 Horsmann

Should be available.

reckart avatar Feb 09 '18 22:02 reckart

I don't remember the rationale. Might be to avoid using subfolders in an execution context... or to reduce the number of files which can at times become very large... maybe @daxenberger remembers more.

reckart avatar Feb 09 '18 22:02 reckart

This was certainly to reduce the number of files produce by TC - which can become quite big for larger datasets. The "visual confirmation" issue could be avoided by writing some sort of log(?) file, which records the names of files written to the archive.

daxenberger avatar Feb 13 '18 06:02 daxenberger

@reckart Do you have a code-example that writes to .zip?

Horsmann avatar Feb 16 '18 10:02 Horsmann

There are examples in these unit tests: https://github.com/dkpro/dkpro-core/blob/57dc82892d1bb419158eff37119dfaaca0763d8b/dkpro-core-api-io-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/api/io/JCasFileWriter_ImplBaseTest.java

reckart avatar Feb 16 '18 15:02 reckart

Actually, it's even in the documentation: https://dkpro.github.io/dkpro-core/releases/1.9.0/docs/user-guide.html#_working_with_zip_archives

reckart avatar Feb 16 '18 15:02 reckart

Hm, when adapting this for the BinaryCasWriter and BinaryCasReader I get a Not in GZIP format exception

writing:
        AnalysisEngineDescription xmiWriter = createEngineDescription(BinaryCasWriter.class,
                BinaryCasWriter.PARAM_TARGET_LOCATION,
                "jar:file:" + aContext.getFolder(output, AccessMode.READWRITE).getPath() + "/data.gz",
                BinaryCasWriter.PARAM_FORMAT, "6+"
                );

reading:
createReaderDescription(BinaryCasReader.class, BinaryCasReader.PARAM_SOURCE_LOCATION,
            		root.getAbsolutePath() + "/data.gz!*.bin");

Horsmann avatar Feb 17 '18 19:02 Horsmann

Looks like during reading, you are missing the jar:file: prefix.

reckart avatar Feb 19 '18 10:02 reckart

... and mind that these are "zip" files, not "gz" files.

reckart avatar Feb 19 '18 10:02 reckart