topic-modeling-tool icon indicating copy to clipboard operation
topic-modeling-tool copied to clipboard

Make the TMT speak dfr-browser

Open senderle opened this issue 8 years ago • 1 comments

Andrew Goldstone's dfr-browser produces lovely visualizations, and it appears to require only some .json input. It would be nice if the TMT could generate that input.

senderle avatar Jan 22 '17 19:01 senderle

After looking closely at Goldstone's prepare-data script, I think this should be doable. That script requires just three things to start: the raw mallet state, gzipped (output-state.gz in the current TMT naming scheme), the ID field from MALLET's standard doc-topics output (doc-topic.txt in the current TMT naming scheme), and a metadata file.

There are specific requirements for the metadata, and that's the only major complication, since we can't expect any particular kind of metadata from any particular project. However, I think we can work around most of that as long as we enforce this one guarantee: the first column of the metadata file must be matchable to the list of file IDs that you get from cut -f 2 doc-topic.txt > ids.txt. That's not so different from the system we're already using; we reinterpret "ID" as "filename," but the two approaches are -- I believe -- functionally indistinguishable. We need to verify that, but if that's correct, then this will be pretty easy!

Though perhaps tedious... since it will involve translating prepare-data into one or more Java classes... wah wah.

senderle avatar Feb 04 '17 18:02 senderle