base65536 icon indicating copy to clipboard operation
base65536 copied to clipboard

"safe" codepoint utility

Open DonaldTsang opened this issue 6 years ago • 5 comments

See https://github.com/qntm/safe-code-point and https://github.com/qntm/base65536gen but something in Python

  • [ ] Version 7
  • [ ] Version 8
  • [ ] Version 9
  • [ ] Version 10
  • [ ] Version 11
  • [ ] Version 12

DonaldTsang avatar Aug 24 '19 13:08 DonaldTsang

https://github.com/Nightbug/go-base65536/issues/1

DonaldTsang avatar Aug 24 '19 13:08 DonaldTsang

First concepts

import unicodedata as ucd
import sys

def table(bits, name):
  temp = {}
  for i in range(sys.maxunicode): # each character
    u = chr(i)
    try:
      name = ucd.name(u)
      if ucd.combining(u) == 0 and ucd.bidirectional(u) not in ['R','AL'] and
      ucd.category(u) not in ['Zs','Zl','Zp','Cc','Cf','Cs','Co','Cn']:
        # disallow diacritics and Right-To-Left characters
        # disallow spaces + control, formatters, surogates, PUAs and non-char
        temp[i//(2**bits)][i%(2**bits)] = [name,
        ucd.normalize('NFC',u) == u, ucd.normalize('NFKC',u) == u,
        ucd.normalize('NFD',u) == u, ucd.normalize('NFKD',u) == u]
    except:
      continue
  answer = []
  for block in temp: # each block
    if len(temp[block]) == 2**bits: # if the block is complete
      answer.append([block, # the j-index itself
      sum([temp[block][k][1] for k in temp[block]])==2**bits, # NFC
      sum([temp[block][k][2] for k in temp[block]])==2**bits, # NFKC
      sum([temp[block][k][3] for k in temp[block]])==2**bits, # NFD
      sum([temp[block][k][4] for k in temp[block]])==2**bits]) # NFKD
  return answer

a = table(8,'byte')
b = table(6,'b64')
c = table(5,'b32')
d = table(4,'balf')

DonaldTsang avatar Aug 27 '19 10:08 DonaldTsang

I think this feature is going to be option. programmer can chose. And the table should compile only once or be dumped(pre compiled) (similar concept https://github.com/dahlia/iso4217/blob/master/iso4217/init.py#L18-L64)

Parkayun avatar Aug 27 '19 11:08 Parkayun

@Parkayun some ideas:

  • Searching code points based on certain properties
    • Removal of RTL or Diacritics, Space/Control character, formatters and non-characters
    • Punctuation and Numbers is a grey-area that we could accept due to qntm being arbitrary
  • Finding codepoint blocks in units of 256
    • but can also be 16/32/64/128 similar to base32768 or base2048
  • Creating a character set similar to the original base65536
    • but can be base32768 or 15384 or 8192 or 4096 or 2048

See: https://github.com/qntm/base32768 and https://github.com/qntm/base2048

DonaldTsang avatar Aug 27 '19 13:08 DonaldTsang