algorithm-archive icon indicating copy to clipboard operation
algorithm-archive copied to clipboard

Algorithms that require plotting

Open leios opened this issue 7 years ago • 21 comments

Hey guys,

There are a few algorithms in the archive that require plotting and a few more are on the way. Obviously, not all languages come with a plotter, so I have been outputting the data into a file format that can be plotted with gnuplot. I don't know if this is good practice or not. On the one hand, it provides a stable output format that allows all code from the AAA to look the same in the end and everything can be plotted with a simple gnuplot plot.gp - command.

On the other hand, some languages already have in-built plotters.

I feel that the languages that do not have plotters, should have a consistent output format. Gnuplot is good for this because it's cross-platform and kinda the standard for HPC plotting.

Should we require everyone to output to gnuplot or should we allow them to use their in-built plotters?

As a note: I think it is best to put the gnuplot script in a separate directory in the code/ directory for each algorithm

Anyway, I wanted to hear your thoughts on this.

leios avatar Sep 26 '18 21:09 leios

I think it's a great idea, as you know I've started doing this with split-op.

Although since a single script can only work with a specific format for the output data, we probably need you to take the lead into specifying that format with your Julia code (or whichever language you want to use). It's also probably best if we have the gnuplot script early on so that people can test it with their implementation.

jiegillet avatar Sep 26 '18 23:09 jiegillet

Yeah, this is what I was thinking. I mainly didn't want to limit the people who might be using julia / python / matlab / etc that have their own in-built plotters. For consistency, though, I think it's better to output everything in the same format. That way, we can check to make sure their code works with a simple diff command instead of looking at two separate images.

Also: I like having the code output raw data

leios avatar Sep 27 '18 00:09 leios

+1 for being able to do git diff --color-words --ignore-all-space --no-index on two data files.

berquist avatar Sep 27 '18 00:09 berquist

I don't think you can use git diff unless we decide to add the output files to the repo.

jiegillet avatar Sep 27 '18 00:09 jiegillet

But maybe we should? I mean, a single data file to compare against might be a good idea in some cases.

It's not like the data will be gigabytes in size, either.

leios avatar Sep 27 '18 00:09 leios

Yeah, why not. It could also be a very natural way to require code testing like we've been talking about. For some algorithms anyway.

jiegillet avatar Sep 27 '18 00:09 jiegillet

Text will compress well under version control, and only one copy of the output would be needed (the "reference").

berquist avatar Sep 27 '18 00:09 berquist

Yeah, alright. Let's put that on the list of things for me to do during Hacktoberfest (next month).

leios avatar Sep 27 '18 00:09 leios

Though, you don't need to have any under VC if you don't want. git diff --no-index actually means "run Git's diff algorithm on any arbitrary files, ignoring a Git index if one is present".

So it's enough to have a reference implementation that you can run (say Julia), but you may not want to require people who are writing their algorithm in Python, Java, ... to run the Julia code, etc. A minor choice.

berquist avatar Sep 27 '18 00:09 berquist

(Sorry, a bit unrelated) Does GitHub compress files?

jiegillet avatar Sep 27 '18 00:09 jiegillet

No. But when Git compresses non-binary files into objects under .git (whenever you commit), IME it does a pretty good job.

berquist avatar Sep 27 '18 00:09 berquist

I had an output file discussion before and the result of the strawpoll, I created, was: no output file. (https://github.com/algorithm-archivists/algorithm-archive/pull/55#issuecomment-367544339) So I think we shouldn't include the output file in the repo to keep in-line with previous decisions. It's also not adding much benefit, since the output file is generated by the code.

june128 avatar Sep 27 '18 00:09 june128

Ah, yeah. I remember this discussion now. Gale--Shapley was a bit of a different case, though. It wasn't meant to output actual data and there was nothing to plot.

Still, the discussion was already had and a result was made after about 8 votes. I feel like we should respect that. Should we have another vote here to see if the community stance has changed?

leios avatar Sep 27 '18 00:09 leios

While I still agree with the result of that poll, I feel that it's different for plotting algorithms since we want to use gnuplot scripts.

jiegillet avatar Sep 27 '18 00:09 jiegillet

Both are still output files. The job of every algorithm is to output some kind of data, whether it gets displayed or not.

I'm fine with another poll; I think it would be interesting to see, what the standpoint of the community is now. But I'm not sure, whether or not we should treat both output files the same. In the end, they are the same file logically, while the use-case differs.

june128 avatar Sep 27 '18 00:09 june128

I think I might have been against that case a while ago because we already had a few implementations for that method and I didn't want to go through and change the implementations we already had; however, I now agree that it would be nice to have some data file to check against even in that case.

For the cases we need to plot data, I also feel this data should be available.

Here is a poll for this discussion: https://www.strawpoll.me/16535759

I'll advertise it on stream tomorrow and Saturday when prepping for Hacktoberfest and decide from there.

leios avatar Sep 27 '18 06:09 leios

There is also just plain diff FILE1 FILE2 if you want to compare data files. You don't need git for that.

Butt4cak3 avatar Sep 27 '18 06:09 Butt4cak3

After thinking about the topic, I found several problem points of an output file, like we imagined:

Consistency

While we can provide an output file for the fractal algorithms and others, we can't for algorithms with random input. We would therefore set different standards for different algorithms, which hurts consistency of the AAA implementations.

Integrity

While at the first glance an output file might look like a convenient way to compare the output of different implementations, we can't be sure, that the output file, provided with the algorithm implementation, really represents the output. An update might come in and the author forgets to update the file. The only way, we can really be sure, what the output of a program is, is by running it. So all the output file does, is introducing another potential error. The output file therefore provides the delusion of integrity, while just causing potential harm.

Testing

Another thought that got mentioned is testing. But even that isn't really possible with the output file. Continuing the arguments of the Integrity point, we can't be fully sure, that the output file really provides the output of the current implementation. Testing the output file is therefore the same as asking the author, if they think, that the provided implementation works right. Since we already do that, moving on to testing an output file doesn't provide any benefit.

The only way, we can enhance testing of the implementations output, is by automatically running the code and comparing the actual output of the program - not the output file - to a pre-defined correct value.

Conclusion

I think, we should be really sure, before we introduce the little convenience of an unreliable, potentially harmful output file and the costs and potential error associated with it.

june128 avatar Sep 27 '18 18:09 june128

I didn't realize, that we were talking about one master output file. In this case, I think it's a valid option, since the above points don't apply.

june128 avatar Sep 27 '18 20:09 june128

I've considered the question more and I'm not so sure anymore. A problem that wasn't brought up is floating point values. Depending on the language, numerical errors might be handled differently. If you run a diff you might get hundreds of differences at the 16th decimal point or something. Then what?

jiegillet avatar Sep 28 '18 04:09 jiegillet

Split-op isn't done this way, but a width for printing should be set, something like 8 decimal places, which should avoid that problem. Really, a width that's sufficient to not suffer from floating point error, but indicates a problem with your algorithm.

berquist avatar Sep 28 '18 13:09 berquist