common-scripts icon indicating copy to clipboard operation
common-scripts copied to clipboard

Common scripts, mainly for text processing and experimental control

local/ -Files of local interest, e.g. with fixed hostnames

README.txt - This file

all-xml-to-json.sh - For every XML file in the command-line, convert it to JSON.

boilerpipe-stdin-urls-to-mongo.py - Run every sys.stdin URL through Boilerpipe (or diffbot), and store in a MongoDB.

citeseer-get.pl - Fetch PDFs from citeseer.

cumulative.py - Output a cumulative sum for each line in the input file.

delexicalize-low-frequency-words.py - Delexicalize all words with freq less than minfreq to UNKNOWN

dumpdb.py - Dump the MongoDB

enscript-landscape-all.pl - Enscript all files listed in @ARGV in landscape mode.

filter-json.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value.

from-one-line-per-word-to-one-line-per-sentence.py - Read one-line-per-word and convert to one-line-per-sentence.

grep-json.py - Filter JSON in sys.stdin to find only docs that match each regex against raw JSON.

grep-json-by-field.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value.

join-json.py - For each JSON file in sys.argv, join them and output to stdout.

lines-with-funny-characters.pl - Print lines with funny characters

lines-with-no-funny-characters.pl - Print lines without funny characters

load-directory-of-textfiles-into-mongodb.py - For all files recursively in a subdir, load them into a MongoDB with a certain field name.

load-json-into-mongodb.py - Load JSON from stdin into a MongoDB

htmldecode.pl - Decode HTML entities, e.g. < becomes <

htmlencode.pl - Encode HTML entities, e.g. < becomes <

html2text - Convert HTML to text

mongodb-count.py - Count the number of entries in a mongodb collection.

mongodb-field-lengths.py - Print MongoDB field length and field, for every row.

mongodb-remove-field.py - Remove every occurrence of some field, for every row, in MongoDB.

mongodb-remove-short-fields.py - Remove every occurrence of some field if it is shorter than some length, for every row, in MongoDB.

mongodb-to-lucene.py - Read all mongo docs, and insert them into Lucene.

one-sentence-per-line-to-json.py - For line in stdin, convert it to a JSON dict with key: "content" and value: line.

page-count.pl - For each file (usually .ps or .pdf) specified in stdin, count the number of pages in the file

print-all.pl - For each file (.ps or .pdf) specified as a command-line argument, print the file to a random printer.

ptb/one-sentence-per-line.pl - Output one PTB sentence per line, using PTB tagged/ files.

read-xml-mysqldump.py - Read in the XML mysqldump from sys.sdin.

remove-funny-characters.pl - Remove any funny character

remove-nonascii-characters.pl - Remove non-ASCII characters

remove-non-utf10-characters.pl - Remove non-UTF 1.0 characters

remove-non-utf11-characters.pl - Remove non-UTF 1.1 characters

sample.pl - Sample and print only a certain percentage of input lines.

shuffle/shuffle.sh - Shuffle lines of stdin

sort-curves.py - Sort gnuplot curves

tokenizer.sed - Penn Treebank tokenizer.

tokenize-English.pl - Word Tokenizer for English by Al-Onaizan and Melamed.

tsv-to-json.py - Read TSV from stdin and output as JSON.

unichars - List characters for one or more properties (by Tom Christiansen)

untokenize - Detokenize Penn Treebank formatted text.

vowpal-to-libsvm.py - Convert a vowpal-wabbit file in stdin to libsvm.

words-integers-mapfile.py - Create a integers mapfile for the words in textfile.

words-to-integers.py - Convert words to integers, according to the mapping in mapfile.

xmlmysqldump.py - Read in the XML mysqldump for sys.sdin.