sample icon indicating copy to clipboard operation
sample copied to clipboard

Split data

Open leakyMirror opened this issue 11 years ago • 8 comments

Is it possible to split one file into two randomly, such that one have p of entries and the ohter 1-p?

leakyMirror avatar Dec 21 '14 12:12 leakyMirror

The implementation doesn't currently do that. I like this idea, though! I'm going to add it, but am thinking about how to cleanly extend it to more than two files. CLI options could look something like this:

# Randomly deal data out to 4 output streams, out[1-4], with
# equal probability. All input ends up in one of them.
cat data | sample -d out1,out2,out3,out4

# Randomly deal data out to 4 output streams, out[1-4].
# out1 gets 25% of input, out2 gets 25% of remaining, ...
# Any lines not chosen for out[1-4] are discarded.
cat data | sample -d out1,out2,out3,out4 -p 0.25

# Same, but all remaining after out4 end up in the last file,
# which is called "rest" here.
cat data | sample -d out1,out2,out3,out4,rest -p 0.25 -D

Thoughts?

The other issue is that creating pipelines involving multiple output files can get awkward. In this case, it should probably open the files given to -d (creating them, if necessary), and append to them. If the user wants to divide the input to several pipe streams, mkfifo(1) could be used.

silentbicycle avatar Dec 21 '14 16:12 silentbicycle

Well, first of all I havent seen a Unix tool which would output two or more streams. So I cant make any suggestions here. However second example is strange. How could there be not chosen lines, if each of four streams gets 25% probability?

# Randomly deal data out to 4 output streams, out[1-4].
# out1 gets 25% of input, out2 gets 25% of remaining, ...
# Any lines not chosen for out[1-4] are discarded.
cat data | sample -d out1,out2,out3,out4 -p 0.25

leakyMirror avatar Dec 21 '14 16:12 leakyMirror

A probability of 25% could mean one of two things here: Each has a 25% chance of getting the line, as you describe, or the first option has a 25% chance, then the next has a 25% chance for the remaining data, then .... In the case where each have a probability of 50%, it actually ends up as 50%, 25%, 12.5%, ... in that case would be possible that not all lines end up somewhere, which may be useful. It's probably not worth the additional complexity, though.

Instead:

# Each stream has a 25% chance of getting the input
# (one probability given applies to all)
cat data | sample -d out1,out2,out3,out4 -p 0.25

# Unequal probabilities
cat data | sample -d out1,out2,out3,out4 -p 0.2,0.5,0.1,0.2

# Unequal probabilities, trailing ',' means "remaining"
cat data | sample -d out1,out2,out3,out4 -p 0.2,0.5,0.1,

# Error: wrong number of probabilities
cat data | sample -d out1,out2,out3,out4 -p 0.2,0.5,0.1

# Error: probabilities > 100%
cat data | sample -d out1,out2,out3,out4 -p 0.2,0.5,0.1,0.3

silentbicycle avatar Dec 21 '14 18:12 silentbicycle

Looks good.

leakyMirror avatar Dec 21 '14 18:12 leakyMirror

I pushed a branch, "deal", that adds this functionality. The multi-file, multi-probability command handling is a bit complicated, but does something reasonable in the various cases: -d a,b,c with no probabilities deals to each with 1/3 chance. -p 0.5,0.25, to 3 files puts half in the first, 1/4 in the second, and puts the remainder in the third. An empty filename in a list of files is treated as /dev/null. Writing out to fifos works nicely. The documentation still needs some work before I merge it to master, though.

Thoughts?

silentbicycle avatar Dec 22 '14 02:12 silentbicycle

I cant compile it :( It throws:

install: cannot stat ‘sample’: No such file or directory

leakyMirror avatar Dec 22 '14 21:12 leakyMirror

Does calling make and make install separately work? In the current makefile, the phony install target doesn't depend on sample being built. If it's a different reason than that, let me know please. (Along with OS / etc. details)

silentbicycle avatar Dec 22 '14 23:12 silentbicycle

make gives this:

cc -std=c99  -g -Wall -pedantic    -O3   -c -o main.o main.c
main.c: In function ‘handle_args’:
main.c:56:5: warning: implicit declaration of function ‘getopt’ [-Wimplicit-function-declaration]
     while ((fl = getopt(argc, argv, "hd:n:p:s:")) != -1) {
     ^
main.c:64:24: error: ‘optarg’ undeclared (first use in this function)
             deal_arg = optarg;
                        ^
main.c:64:24: note: each undeclared identifier is reported only once for each function it appears in
main.c:94:14: error: ‘optind’ undeclared (first use in this function)
     argc -= (optind-1);
              ^
main.c: In function ‘parse_percent_settings’:
main.c:124:9: warning: implicit declaration of function ‘strsep’ [-Wimplicit-function-declaration]
         for (char *fn = strsep(&deal_arg, ",");
         ^
main.c:124:25: warning: initialization makes pointer from integer without a cast [enabled by default]
         for (char *fn = strsep(&deal_arg, ",");
                         ^
main.c:124:9: error: declaration of non-variable ‘strsep’ in ‘for’ loop initial declaration
         for (char *fn = strsep(&deal_arg, ",");
         ^
main.c:125:21: warning: assignment makes pointer from integer without a cast [enabled by default]
              fn; fn = strsep(&deal_arg, ",")) {
                     ^
main.c:149:31: warning: initialization makes pointer from integer without a cast [enabled by default]
             for (char *perc = strsep(&perc_arg, ",");
                               ^
main.c:149:13: error: declaration of non-variable ‘strsep’ in ‘for’ loop initial declaration
             for (char *perc = strsep(&perc_arg, ",");
             ^
main.c:150:29: warning: assignment makes pointer from integer without a cast [enabled by default]
                  perc; perc = strsep(&perc_arg, ",")) {
                             ^
main.c: In function ‘line_iter’:
main.c:232:9: warning: implicit declaration of function ‘fgetln’ [-Wimplicit-function-declaration]
         char *line = fgetln(cfg->cur_file, &len);
         ^
main.c:232:22: warning: initialization makes pointer from integer without a cast [enabled by default]
         char *line = fgetln(cfg->cur_file, &len);
                      ^
main.c: In function ‘main’:
main.c:261:5: warning: implicit declaration of function ‘srandom’ [-Wimplicit-function-declaration]
     srandom(cfg.seed);
     ^
make: *** [main.o] Error 1

make install gives

install -c sample /usr/local/bin
install: cannot stat ‘sample’: No such file or directory
make: *** [install] Error 1

I think that moving it to /usr/local/bin is not necessary, because not everybody keeps theirs tools there :) For example I like to keep everything custom in ~/bin

I am using Elementary Isis, which is technically Ubuntu 14.04.

leakyMirror avatar Dec 23 '14 00:12 leakyMirror