pinyin 启用segmentation之后，支持让字/词与拼音的建立关联

如下API，segmentation虽然对拼音做了分组，但是没有返回对应的汉字词语（或其索引）。

console.log(pinyin("我喜欢你", {
  segment: "segmentit",         // 启用分词
  group: true,                  // 启用词组
}));                            // [ [ 'wǒ' ], [ 'xǐhuān' ], [ 'nǐ' ] ]

这样导致我无法根据返回值关联汉字与拼音，无法为字/词创建如下拼音注解。

<ruby>我<rt>wǒ</rt></ruby>
<ruby>喜欢<rt>xǐhuān</rt></ruby>
<ruby>你<rt>nǐ</rt></ruby>

要解决这个问题，需要修改API的返回值格式，否则就得先分词，再对与每个字/词调一遍pinyin()方法。

如果不想修改现有pinyin()方法的返回值格式（以免造成breaking change），那么我提议添加一个方法pinyin.segment()，用法如下：

pinyin.segment("我喜欢你", {
  method: "segmentit",          // 选择分词的实现方法
  group: true,                  // 启用词组
  // ...其他选项保持与pinyin()的选项一样
})

// 返回值格式
[
  {segment: '我', index: 0, candidates: ['wǒ']},
  {segment: '喜欢', index: 1, candidates: ['xǐhuān']},
  {segment: '你', index: 3, candidates: ['nǐ']}
]

注：原拼音候选数组将移动一个对象作为属性candidates的值，另需在对象中返回属性segment,index二者至少其一。

参考类似的API设计

https://wicg.github.io/handwriting-recognition/#get-predictions-of-a-drawing
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/segment

May 04 '24 23:05 fuweichin

附带说几句：

因代码import PinyinBase, { getPinyinInstance } from "./PinyinBase";缺少扩展名造成浏览器加载出错。为了让esm能直接在浏览器中使用，需要保证import specifier带有.js扩展名。保证specifier带.js扩展名一种方法是：使用tsc-esm命令代替tsc命令来编译输出esm

因代码import { Segment, useDefault } from "segmentit";造依赖项耦合（segmentit.js比较大，达3.65M），考虑改成插件架构，让调用者按需动态加载segmentation实现、按需动态加载数据字典。

假如如不需要分词，且只需要用到常用2500字，加载7.4MB的脚本似乎不划算，期待有一份针对在线场景考虑的轻量版。

May 05 '24 05:05 fuweichin

I needed this feature as well. As a workaround, I segmented the text using the same command internally used by the pinyin library, and then combined the resulting arrays:

const { Segment, useDefault } = require('segmentit')
const segmentit = useDefault(new Segment())
const text = "我喜欢你"
const segments = segmentit.doSegment(text, { simple: true })
// [ '我', '喜欢', '你' ]
candidates = pinyin(text, { segment: "segmentit", group: true })
// [ [ 'wǒ' ], [ 'xǐhuān' ], [ 'nǐ' ] ]
const words = segments.map((segment, index) => ({
  segment,
  index,
  candidate: candidates[index]
}))
// [
//   { segment: '我', index: 0, candidate: [ 'wǒ' ] },
//   { segment: '喜欢', index: 1, candidate: [ 'xǐhuān' ] },
//   { segment: '你', index: 2, candidate: [ 'nǐ' ] }
// ]

Aug 24 '24 01:08 etuardu

pinyin pinyin copied to clipboard

启用segmentation之后，支持让字/词与拼音的建立关联

pinyin
pinyin copied to clipboard