warc icon indicating copy to clipboard operation
warc copied to clipboard

Interfaces to write warc.gz / CDX files

Open riking opened this issue 7 years ago • 3 comments

The package should provide facilities to write warc.gz and CDX file pairs, and to append to already existing WARC/CDX pairs (see wpull --warc-append). Should also support uncompressed WARC files with uncompressed CDX size/offsets.

This issue is to discuss interface requirements.

Identified requirements:

  • either CDX headers need to be provided by caller, or an existing file opened as ReadWriteSeeker (ugly, prefer option 1)
  • Output needs to be a WriteSeeker to grab file offsets (S/V fields in cdx, named CDXCompressedSize / CDXCompressedOffset in PR #12 )

WriteRecord() would go something like: write record to *writer, Flush the *writer, grab the file offsets and save into CDX

riking avatar Feb 26 '18 23:02 riking

I think that CDX functionality doesn't really fit well in this package, so I designed a different interface. How does this look?

type flusher interface {
	Flush() error
}

// Writer provides functionality for writing WARC files in compressed and
// uncompressed formats.
//
// To construct a Writer, call NewWriterCompressed or NewWriterRaw.
type Writer struct {
	seekW io.WriteSeeker
	w     io.Writer

	// RecordCallback will be called after each record is written to the file.
	// If a WriteSeeker was not provided, the provided positions will be
	// invalid.
	RecordCallback func(r *Record, startPos, endPos int64)
}

// NewWriterCompressed initializes a WARC Writer writing to a compressed
// stream.  The first parameter should be the "backing stream" of the
// compression.  The second parameter must implement interface{Flush() error},
// which should establish a "checkpoint" in the compressed stream - a place
// where decompression can be resumed partway through, so individual records
// can be retrieved from the compressed file.
//
// Seek will only be called with whence == io.SeekCurrent and offset == 0.
//
// See also CountWriter() if you need a "fake" Seek implementation.
func NewWriterCompressed(rawFile io.WriteSeeker, cmprsWriter io.Writer) (*Writer, error) {}

// NewWriterRaw initializes a WARC Writer writing to an uncompressed stream.
// If the provided Writer implements io.Seeker, the RecordCallback will be
// available.  If the provided Writer implements interface{Flush() error}, it
// will be flushed after every written Record.
func NewWriterRaw(w io.Writer) (*Writer, error) {}

And a CountWriter utility for e.g. writing to a net.Conn:

type countWriter struct {
	count int64
	w     io.Writer
}

// CountWriter implements a limited version of io.Seeker around the provided
// Writer.  It only supports offset == 0 and whence == io.SeekCurrent or
// io.SeekEnd, and returns the current number of written bytes in both cases.
func CountWriter(w io.Writer) io.WriteSeeker {
	return countWriter{count: 0, w: w}
}

// implements io.Writer
func (c *countWriter) Write(p []byte) (int, error) {
	n, err := c.w.Write(p)
	if n >= 0 {
		c.count += n
	}
	return n, err
}

var errCountWriterNotImplemented = stdErrors.New("unsupported seek operation")

// implements io.Seeker
func (c *countWriter) Seek(offset int64, whence int) (int64, error) {
	if offset != 0 || !(whence == io.SeekCurrent || whence == io.SeekEnd) {
		return errCountWriterNotImplemented
	}
	return c.count, nil
}

riking avatar Feb 27 '18 18:02 riking

update: reading more of the gzip stuff, I think Flush is not sufficient - it needs a Close / Reset.

riking avatar Feb 28 '18 18:02 riking

Thx for the update @riking, I'm hoping to take some time this weekend to sit down with your proposed interface change & understand your use case. Hopefully I'll be able to add constructive input, as this sounds like another exciting update!

b5 avatar Mar 01 '18 23:03 b5