common-scripts
common-scripts copied to clipboard
Common scripts, mainly for text processing and experimental control
local/ -Files of local interest, e.g. with fixed hostnames
README.txt - This file
all-xml-to-json.sh - For every XML file in the command-line, convert it to JSON.
boilerpipe-stdin-urls-to-mongo.py - Run every sys.stdin URL through Boilerpipe (or diffbot), and store in a MongoDB.
citeseer-get.pl - Fetch PDFs from citeseer.
cumulative.py - Output a cumulative sum for each line in the input file.
delexicalize-low-frequency-words.py - Delexicalize all words with freq less than minfreq to UNKNOWN
dumpdb.py - Dump the MongoDB
enscript-landscape-all.pl - Enscript all files listed in @ARGV in landscape mode.
filter-json.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value.
from-one-line-per-word-to-one-line-per-sentence.py - Read one-line-per-word and convert to one-line-per-sentence.
grep-json.py - Filter JSON in sys.stdin to find only docs that match each regex against raw JSON.
grep-json-by-field.py - Filter JSON in sys.stdin to find only docs that match each regex with at least one field value.
join-json.py - For each JSON file in sys.argv, join them and output to stdout.
lines-with-funny-characters.pl - Print lines with funny characters
lines-with-no-funny-characters.pl - Print lines without funny characters
load-directory-of-textfiles-into-mongodb.py - For all files recursively in a subdir, load them into a MongoDB with a certain field name.
load-json-into-mongodb.py - Load JSON from stdin into a MongoDB
htmldecode.pl - Decode HTML entities, e.g. < becomes <
htmlencode.pl - Encode HTML entities, e.g. < becomes <
html2text - Convert HTML to text
mongodb-count.py - Count the number of entries in a mongodb collection.
mongodb-field-lengths.py - Print MongoDB field length and field, for every row.
mongodb-remove-field.py - Remove every occurrence of some field, for every row, in MongoDB.
mongodb-remove-short-fields.py - Remove every occurrence of some field if it is shorter than some length, for every row, in MongoDB.
mongodb-to-lucene.py - Read all mongo docs, and insert them into Lucene.
one-sentence-per-line-to-json.py - For line in stdin, convert it to a JSON dict with key: "content" and value: line.
page-count.pl - For each file (usually .ps or .pdf) specified in stdin, count the number of pages in the file
print-all.pl - For each file (.ps or .pdf) specified as a command-line argument, print the file to a random printer.
ptb/one-sentence-per-line.pl - Output one PTB sentence per line, using PTB tagged/ files.
read-xml-mysqldump.py - Read in the XML mysqldump from sys.sdin.
remove-funny-characters.pl - Remove any funny character
remove-nonascii-characters.pl - Remove non-ASCII characters
remove-non-utf10-characters.pl - Remove non-UTF 1.0 characters
remove-non-utf11-characters.pl - Remove non-UTF 1.1 characters
sample.pl - Sample and print only a certain percentage of input lines.
shuffle/shuffle.sh - Shuffle lines of stdin
sort-curves.py - Sort gnuplot curves
tokenizer.sed - Penn Treebank tokenizer.
tokenize-English.pl - Word Tokenizer for English by Al-Onaizan and Melamed.
tsv-to-json.py - Read TSV from stdin and output as JSON.
unichars - List characters for one or more properties (by Tom Christiansen)
untokenize - Detokenize Penn Treebank formatted text.
vowpal-to-libsvm.py - Convert a vowpal-wabbit file in stdin to libsvm.
words-integers-mapfile.py - Create a integers mapfile for the words in textfile.
words-to-integers.py - Convert words to integers, according to the mapping in mapfile.
xmlmysqldump.py - Read in the XML mysqldump for sys.sdin.