godap
godap copied to clipboard
Add warc support to godap?
Currently if someone was to download this warc file from dap:
https://github.com/rapid7/dap/blob/master/samples/iawide.warc.bz2
The could parse this with dap
:
$ bzcat iawide.warc.bz2 | dap warc + json | head -n 1 | jq 'keys'
[
"content",
"content_length",
"content_type",
"warc_date",
"warc_ip_address",
"warc_payload_digest",
"warc_record_id",
"warc_target_uri",
"warc_type"
]
However if they were to try this with godap
:
$ bzcat iawide.warc.bz2 | ./dappy warc + json | head -n 1 | jq 'keys'
bzcat: Can't open input file iawide.warc.bz2: No such file or directory.
Error: Invalid input plugin: warc
Usage: ./dappy [input] + [filter] + [output]
--inputs
--outputs
--filters
Example: echo world | ./dappy lines stdin + rename line=hello + json stdout
Looking at supported types:
$ ./dappy --inputs
Inputs:
* json
* lines
It's not there.
We could use this golang library: https://github.com/slyrz/warc
Which actually supports reading compressed warc files.
A sample script:
package main
import (
"fmt"
"github.com/slyrz/warc"
"os"
"encoding/json"
"bytes"
)
type godapWarc struct {
Type string `json:"warc_type"`
TargetUri string `json:"warc_target_uri"`
Id string `json:"warc_record_id"`
ContentLength string `json:"content_length"`
Date string `json:"warc_date"`
ContentType string `json:"content_type"`
PayloadDigest string `json:"warc_payload_digest"`
IpAddress string `json:"warc_ip_address"`
Content string `json:"content"`
}
func main(){
reader, err := warc.NewReader(os.Stdin)
if err != nil {
panic(err)
}
defer reader.Close()
for {
record, err := reader.ReadRecord()
if err != nil {
break
}
buf := new(bytes.Buffer)
buf.ReadFrom(record.Content)
warc_rec := &godapWarc{
Type: record.Header["warc-type"],
TargetUri: record.Header["warc-target-uri"],
Id: record.Header["warc-record-id"],
ContentLength: record.Header["content-length"],
ContentType: record.Header["content-type"],
Date: record.Header["warc-date"],
PayloadDigest: record.Header["warc-payload-digest"],
IpAddress: record.Header["warc-ip-address"],
Content: buf.String(),
}
warc_jsonstr, _ := json.Marshal(warc_rec)
fmt.Println(string(warc_jsonstr))
}
}
Which when run:
$ cat ~/Downloads/iawide.warc.bz2 | go run warc_reader2.go | head -n 1 | jq 'keys'
[
"content",
"content_length",
"content_type",
"warc_date",
"warc_ip_address",
"warc_payload_digest",
"warc_record_id",
"warc_target_uri",
"warc_type"
]
$ cat ~/Downloads/iawide.warc.bz2 | go run warc_reader2.go | head -n 1
{"warc_type":"warcinfo","warc_target_uri":"","warc_record_id":"\u003curn:uuid:88fbcbee-f24e-47c1-b0c4-f7a9530ceb74\u003e","content_length":"442","warc_date":"2011-02-25T18:32:19Z","content_type":"application/warc-fields","warc_payload_digest":"","warc_ip_address":"","content":"software: Heritrix/3.0.1-SNAPSHOT-20110127.213729 http://crawler.archive.org\r\nip: 207.241.232.79\r\nhostname: crawl301.us.archive.org\r\nformat: WARC File Format 1.0\r\nconformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf\r\noperator: [email protected]\r\nisPartOf: wide\r\ndescription: seeds.txt\r\nrobots: obey\r\nhttp-header-user-agent: Mozilla/5.0 (compatible; archive.org_bot +http://www.archive.org/details/archive.org_bot)\r\n\r\n"}
Can spit out similar content to dap.
Maybe we could add this to filters part of the factory?