topic-modeling-tool
topic-modeling-tool copied to clipboard
Make the TMT speak dfr-browser
Andrew Goldstone's dfr-browser produces lovely visualizations, and it appears to require only some .json
input. It would be nice if the TMT could generate that input.
After looking closely at Goldstone's prepare-data
script, I think this should be doable. That script requires just three things to start: the raw mallet state, gzipped (output-state.gz
in the current TMT naming scheme), the ID field from MALLET's standard doc-topics output (doc-topic.txt
in the current TMT naming scheme), and a metadata file.
There are specific requirements for the metadata, and that's the only major complication, since we can't expect any particular kind of metadata from any particular project. However, I think we can work around most of that as long as we enforce this one guarantee: the first column of the metadata file must be matchable to the list of file IDs that you get from cut -f 2 doc-topic.txt > ids.txt
. That's not so different from the system we're already using; we reinterpret "ID" as "filename," but the two approaches are -- I believe -- functionally indistinguishable. We need to verify that, but if that's correct, then this will be pretty easy!
Though perhaps tedious... since it will involve translating prepare-data
into one or more Java classes... wah wah.