libxo icon indicating copy to clipboard operation
libxo copied to clipboard

The trim modifier should remove whitespace from isspace(3) not just the space character

Open jlduran opened this issue 9 months ago • 4 comments

The function xo_trim_ws should remove white space according to isspace(3), not just space characters (the fix is not straightforward, so reporting it as a bug).

For example, if passing tab characters (the white space is just a literal tab):

$ xo "{t:description}" "	squash this!	" --libxo json
"description":"	squash this!	"

Related to #101.

jlduran avatar Mar 16 '25 22:03 jlduran

Trimming space and tab makes sense, but I'm not sure isspace() is suitable. Does trimming '\r' and '\n' give a suitable behavior?

Thanks, Phil

philshafer avatar Apr 02 '25 03:04 philshafer

Trimming space and tab makes sense, but I'm not sure isspace() is suitable. Does trimming '\r' and '\n' give a suitable behavior?

Ideally, yes.

My original thinking was when trimming user-input data (e.g., a description field). Focusing solely on JSON, for example, I would expect it to behave just like ES5's String.prototype.trim(), where isspace(3) was a good first start. However, after checking out the development branch (I'll close #101 by the way), I would even consider whitespace all these:

https://github.com/v8/v8/blob/f8b4967b71b8d74f692ff41fbfd8f3edff5a5b23/test/webkit/string-trim.js#L33-L56

const whitespace = [
  {s: '\u0009', t: 'HORIZONTAL TAB'},
  {s: '\u000A', t: 'LINE FEED OR NEW LINE'},
  {s: '\u000B', t: 'VERTICAL TAB'},
  {s: '\u000C', t: 'FORMFEED'},
  {s: '\u000D', t: 'CARRIAGE RETURN'},
  {s: '\u0020', t: 'SPACE'},
  {s: '\u00A0', t: 'NO-BREAK SPACE'},
  {s: '\u2000', t: 'EN QUAD'},
  {s: '\u2001', t: 'EM QUAD'},
  {s: '\u2002', t: 'EN SPACE'},
  {s: '\u2003', t: 'EM SPACE'},
  {s: '\u2004', t: 'THREE-PER-EM SPACE'},
  {s: '\u2005', t: 'FOUR-PER-EM SPACE'},
  {s: '\u2006', t: 'SIX-PER-EM SPACE'},
  {s: '\u2007', t: 'FIGURE SPACE'},
  {s: '\u2008', t: 'PUNCTUATION SPACE'},
  {s: '\u2009', t: 'THIN SPACE'},
  {s: '\u200A', t: 'HAIR SPACE'},
  {s: '\u3000', t: 'IDEOGRAPHIC SPACE'},
  {s: '\u2028', t: 'LINE SEPARATOR'},
  {s: '\u2029', t: 'PARAGRAPH SEPARATOR'},
  {s: '\u200B', t: 'ZERO WIDTH SPACE (category Cf)'},
];

jlduran avatar Apr 02 '25 05:04 jlduran

That would be locale-based assumably, and isspace() handles that:

     In the "C" locale, isspace() returns non-zero for these characters only.  The value
     of the argument must be representable as an unsigned char or the value of EOF.

     The isspace_l() function takes an explicit locale argument, whereas the isspace()
     function uses the current global or per-thread locale.

But even this would require that I build codepoints to pass to isspace(), rather than the current byte-by-byte means.

Even so, this should be doable.

Thanks, Phil

philshafer avatar Apr 02 '25 13:04 philshafer