sdxpy
sdxpy copied to clipboard
file compressor
I wonder if byte-pair encoding would be an interesting algorithm to implement in this chapter. I suspect it's probably right-sized for implementing in a book chapter. While it's not a state-of-the-art compressor today, it is SOTA for NLP tokenization used in LLMs like the GPTs. That offers an opportunity to talk about some relevant topics in software engineering ethics using the implemented compressor as a demonstration.
For example, pretraining the compression dictionary on the English version of SDXJS probably handles the English SDXPY pretty reasonably. It probably does less well, but okay on Shakespeare, and probably terribly on Atukagawa Ryūnosuke. As we engineer our tools to be more data-driven, availability biases in how we obtain the data to build those tools have consequences that we need to think about.