hanzidentifier icon indicating copy to clipboard operation
hanzidentifier copied to clipboard

Python module that identifies Chinese text as being Simplified or Traditional

================ Hanzi Identifier

.. image:: https://badge.fury.io/py/hanzidentifier.png :target: http://badge.fury.io/py/hanzidentifier

.. image:: https://travis-ci.org/tsroten/hanzidentifier.png?branch=develop :target: https://travis-ci.org/tsroten/hanzidentifier

Hanzi Identifier is a simple Python module that identifies a string of text as having Simplified or Traditional characters.

  • GitHub: https://github.com/tsroten/hanzidentifier
  • Free software: MIT license

About

Easy-to-use helper functions for identifying strings:

.. code:: python

>>> import hanzidentifier
>>> hanzidentifier.has_chinese('Hello my name is John.')
False
>>> hanzidentifier.is_simplified('John说:你好!')
True
>>> hanzidentifier.is_traditional('John說:你好!')
True
>>> hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.')
True

Here it is without the helper functions:

.. code:: python

>>> hanzidentifier.identify('Hello my name is Thomas.') is hanzidentifier.UNKNOWN
True
>>> hanzidentifier.identify('Thomas 说:你好!') is hanzidentifier.SIMPLIFIED
True
>>> hanzidentifier.identify('Thomas 說:你好!') is hanzidentifier.TRADITIONAL
True
>>> hanzidentifier.identify('你好!') is hanzidentifier.BOTH
True
>>> hanzidentifier.identify('Country in Simplified: 国家. Country in Traditional: 國家.' ) is hanzidentifier.MIXED
True

hanzidentifier.identify has five possible return values:

  • hanzidentifier.UNKNOWN: there are no recognized Chinese characters in the string.
  • hanzidentifier.BOTH: the string is compatible with both Simplified and Traditional character systems.
  • hanzidentifier.TRADITIONAL: the string consists of Traditional characters.
  • hanzidentifier.SIMPLIFIED: the string consists of Simplified characters.
  • hanzidentifier.MIXED: the string consists of characters recognized solely as Traditional characters and also consists of characters recognized solely as Simplified characters.

Characters that aren't found in CC-CEDICT are ignored when determining a string's identity. Hanzi Identifier uses the CC-CEDICT data provided by Zhon <https://github.com/tsroten/zhon>_ to identify Chinese characters.

Because the Traditional and Simplified Chinese character systems overlap, a string containing Simplified characters could identify as hanzidentifer.SIMPLIFIED or hanzidentifier.BOTH depending on if the characters are also Traditional characters.

Hanzi Identifier's functions accept and return unicode.

Getting Started

  • Install Hanzi Identifier: $ pip install hanzidentifier
  • Report bugs and ask questions via GitHub Issues <https://github.com/tsroten/hanzidentifier/issues>_
  • Contribute features or bug fixes <https://github.com/tsroten/hanzidentifier/pulls>_

Change Log

v1.0.2 (2015-08-06)


* New README format
* Adds Travis CI support
* Uses ``io.open()`` in ``setup.py``. Fixes #1.

v1.0.1 (2014-04-14)
  • Fixes URL typo.

v1.0 (2014-04-12)


Version 1.0 merges some changes from Dragon Mapper. It is not backwards compatible with
the previous versions of Hanzi Identifier (e.g. some of the constants are named differently).

* Merges code from `Dragon Mapper <http://github.com/tsroten/dragonmapper>`_ project.
* Adds tox support.

v0.1 (2013-04-24)
  • Initial release.