xsv icon indicating copy to clipboard operation
xsv copied to clipboard

Feature request: random partition

Open plainas opened this issue 5 years ago • 6 comments

For those of us working machine learning, a feature to quickly divide the data set into training data and test data would be a really nice to have. Is there a way to do this already?

I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?

plainas avatar Jun 02 '19 23:06 plainas

Is there a way to do this already?

I can't think of any simple way. But if xsv sort grew a flag to shuffle the rows (analogous to sort's -R/--random-sort flag), then it would be a simple matter of a shuffle followed by xsv slice.

I am tempted to use other command line tools to achieve this by partitioning lines rather than csv rows. Is there a way to escape new lines inside values so I ensure that each line of output is exactly one CSV row?

No. Not without layering your own encoding on top of CSV. If you need to handle arbitrary CSV data, then using other command line tools won't work. If you can guarantee that all CSV records occupy a single line, then other line oriented tools would work okay.

BurntSushi avatar Jun 03 '19 10:06 BurntSushi

@plainas This may or may not help but a while ago I wrote a separate tool for doing this: https://github.com/sd2k/ttv

You can compose it with xsv if desired, e.g. if you need to select columns etc.

sd2k avatar Jun 11 '19 14:06 sd2k

@sd2k Neat tool, although it doesn't look like it correctly supports CSV data? I don't see any CSV parsing happening in that tool. (A single CSV record can span an arbitrary number of lines.)

BurntSushi avatar Jun 11 '19 14:06 BurntSushi

Ah, I misread the initial description. You're right, that tool is completely naive when it comes to nested newlines. It could potentially be 'upgraded' if there's a need for it!

sd2k avatar Jun 11 '19 14:06 sd2k

There definitely is :)

plainas avatar Jun 11 '19 14:06 plainas

Y'all might consider my suggested implementation strategy. There's really no need for a separate tool for the stated use case. That is, all you need to do is add random sorting to xsv sort. Once you have that, you can dice it up any way you want. It should be fairly easy to implement using rand's shuffle routine. PRs are welcome.

BurntSushi avatar Jun 11 '19 14:06 BurntSushi