coralnet icon indicating copy to clipboard operation
coralnet copied to clipboard

Test non-Western text in various forms and pages

Open StephenChan opened this issue 7 years ago • 0 comments

(This used to be a bug issue which needed more information. I'm repurposing it into a unit-test issue.)

As CoralNet gets bigger, it becomes more important to handle non-ASCII and non-Western text (in general, Unicode) properly in any place the site supports text input. Even if the site's de facto language remains solely English, the data are coming in from institutions around the world. Any CharField, TextField, or uploaded .txt/.csv/other file is fair game.

By "handle", I mean do the following for each instance of text input:

  • If we decide to accept any Unicode character, make sure the characters are properly saved to the database and properly displayed on the site.
  • If we decide NOT to accept non-ASCII characters, or impose some other kind of restriction, then make sure we impose that restriction.

By "make sure", I mean write and run unit tests involving non-Western text input.

If we can be comprehensive about writing these kinds of tests, we'll also cover possibly one of the biggest tasks toward supporting Python 3 (#59 ).

Existing and to-do tests:

  • accounts
    • [x] User model - username: Non-ASCII not allowed, so usernames are simple and easy to read/type/differentiate for everyone.
    • [x] User model - email address: Non-ASCII not allowed, since such addresses are not common, and Django's built-in EmailValidator doesn't support non-ASCII yet (https://code.djangoproject.com/ticket/27029).
  • export
    • [x] Export annotations CSV: Non-ASCII image names and label codes OK.
    • [x] Export annotations CPC: Non-ASCII label codes OK.
    • [ ] Export annotations CPC: Non-ASCII image names should be OK. This doesn't work yet because the local image filepath gets path-manipulated at some point, and in Python 2.x, pathlib2 doesn't support Unicode.
    • [x] Export covers: Non-ASCII image names and label codes OK.
    • [x] Export labelset: Non-ASCII label codes OK.
    • [x] Export metadata: Non-ASCII image names and aux meta values OK.
  • labels
    • [x] Import labelset: Non-ASCII label codes OK.
  • lib
    • [x] Get-form-error functions in lib/forms.py: Non-ASCII error messages OK.
  • upload
    • [x] Upload metadata: Non-ASCII aux meta values OK.

Known related bugs; these will be tracked in other issues, not here:

  • ~~#125~~
  • ~~#126~~
  • #202
  • OperationalError: (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='") in the new-source page: This problem no longer exists. This happened in August 2016 with a location key 3 containing the Hawaiian punctuation mark ʻ info here. Relevant changes since then including moving from MySQL to PostgreSQL, and moving from "location keys" to aux. metadata fields.

StephenChan avatar Aug 21 '17 01:08 StephenChan