SentenTree How to use this with non-English text?

trafficstars

Especially languages without space as word-delimiter, such as Thai Reported by @kidjykit

Oct 18 '17 07:10 kristw

I just publish version 1.0.0. Please give it a try. The key is to produce data files that is already tokenized into words delimited by space then enforces tokenization by space when calling SentenTree. The default tokenizer try to be more smart for Tweets but it does not support non-English characters.

import { tsv } from 'd3-request';
import { SentenTreeBuilder, SentenTreeVis, tokenizer } from '../../src/main.js';

const container = document.querySelector('#vis');
container.innerHTML = 'Loading ...';

tsv('data/test_thai.tsv', (error, rawdata) => {
  const data = rawdata.map(d => Object.assign({}, d, { count: +d.count }));
  console.time('Build model');
  const model = new SentenTreeBuilder()
    // enforce tokenize by space
    .tokenize(tokenizer.tokenizeBySpace) 
    .transformToken(token => (/score(d|s)?/.test(token) ? 'score' : token))
    // you can adjust the maxSupportRatio (0-1)
    // higher maxsupport will tend to group the graph together in one piece
    // lower maxsupport will break it into multiple graphs
    .buildModel(data, { maxSupportRatio: 1 });
  console.timeEnd('Build model');

  container.innerHTML = '';

  new SentenTreeVis(container)
    .data(model.getRenderedGraphs(5))
    .on('nodeClick', node => {
      console.log('node', node);
    })
    // .on('nodeMouseenter', node => {
    //   // Do something
    // })
    // .on('nodeMousemove', node => {
    //   // Do something
    // })
    // .on('nodeMouseleave', () => {
    //   // Do something
    // });
});

Oct 18 '17 07:10 kristw

This is very useful tool. Thank you very much for your guidance and support.

Oct 18 '17 08:10 kidjykit

SentenTree SentenTree copied to clipboard

How to use this with non-English text?

SentenTree
SentenTree copied to clipboard