SentenTree
SentenTree copied to clipboard
How to use this with non-English text?
trafficstars
Especially languages without space as word-delimiter, such as Thai Reported by @kidjykit
I just publish version 1.0.0. Please give it a try. The key is to produce data files that is already tokenized into words delimited by space then enforces tokenization by space when calling SentenTree. The default tokenizer try to be more smart for Tweets but it does not support non-English characters.
import { tsv } from 'd3-request';
import { SentenTreeBuilder, SentenTreeVis, tokenizer } from '../../src/main.js';
const container = document.querySelector('#vis');
container.innerHTML = 'Loading ...';
tsv('data/test_thai.tsv', (error, rawdata) => {
const data = rawdata.map(d => Object.assign({}, d, { count: +d.count }));
console.time('Build model');
const model = new SentenTreeBuilder()
// enforce tokenize by space
.tokenize(tokenizer.tokenizeBySpace)
.transformToken(token => (/score(d|s)?/.test(token) ? 'score' : token))
// you can adjust the maxSupportRatio (0-1)
// higher maxsupport will tend to group the graph together in one piece
// lower maxsupport will break it into multiple graphs
.buildModel(data, { maxSupportRatio: 1 });
console.timeEnd('Build model');
container.innerHTML = '';
new SentenTreeVis(container)
.data(model.getRenderedGraphs(5))
.on('nodeClick', node => {
console.log('node', node);
})
// .on('nodeMouseenter', node => {
// // Do something
// })
// .on('nodeMousemove', node => {
// // Do something
// })
// .on('nodeMouseleave', () => {
// // Do something
// });
});

This is very useful tool. Thank you very much for your guidance and support.