algorithm-archive
algorithm-archive copied to clipboard
Algorithms that require plotting
Hey guys,
There are a few algorithms in the archive that require plotting and a few more are on the way. Obviously, not all languages come with a plotter, so I have been outputting the data into a file format that can be plotted with gnuplot. I don't know if this is good practice or not. On the one hand, it provides a stable output format that allows all code from the AAA to look the same in the end and everything can be plotted with a simple gnuplot plot.gp - command.
On the other hand, some languages already have in-built plotters.
I feel that the languages that do not have plotters, should have a consistent output format. Gnuplot is good for this because it's cross-platform and kinda the standard for HPC plotting.
Should we require everyone to output to gnuplot or should we allow them to use their in-built plotters?
As a note: I think it is best to put the gnuplot script in a separate directory in the code/ directory for each algorithm
Anyway, I wanted to hear your thoughts on this.
I think it's a great idea, as you know I've started doing this with split-op.
Although since a single script can only work with a specific format for the output data, we probably need you to take the lead into specifying that format with your Julia code (or whichever language you want to use). It's also probably best if we have the gnuplot script early on so that people can test it with their implementation.
Yeah, this is what I was thinking. I mainly didn't want to limit the people who might be using julia / python / matlab / etc that have their own in-built plotters. For consistency, though, I think it's better to output everything in the same format. That way, we can check to make sure their code works with a simple diff command instead of looking at two separate images.
Also: I like having the code output raw data
+1 for being able to do git diff --color-words --ignore-all-space --no-index on two data files.
I don't think you can use git diff unless we decide to add the output files to the repo.
But maybe we should? I mean, a single data file to compare against might be a good idea in some cases.
It's not like the data will be gigabytes in size, either.
Yeah, why not. It could also be a very natural way to require code testing like we've been talking about. For some algorithms anyway.
Text will compress well under version control, and only one copy of the output would be needed (the "reference").
Yeah, alright. Let's put that on the list of things for me to do during Hacktoberfest (next month).
Though, you don't need to have any under VC if you don't want. git diff --no-index actually means "run Git's diff algorithm on any arbitrary files, ignoring a Git index if one is present".
So it's enough to have a reference implementation that you can run (say Julia), but you may not want to require people who are writing their algorithm in Python, Java, ... to run the Julia code, etc. A minor choice.
(Sorry, a bit unrelated) Does GitHub compress files?
No. But when Git compresses non-binary files into objects under .git (whenever you commit), IME it does a pretty good job.
I had an output file discussion before and the result of the strawpoll, I created, was: no output file. (https://github.com/algorithm-archivists/algorithm-archive/pull/55#issuecomment-367544339) So I think we shouldn't include the output file in the repo to keep in-line with previous decisions. It's also not adding much benefit, since the output file is generated by the code.
Ah, yeah. I remember this discussion now. Gale--Shapley was a bit of a different case, though. It wasn't meant to output actual data and there was nothing to plot.
Still, the discussion was already had and a result was made after about 8 votes. I feel like we should respect that. Should we have another vote here to see if the community stance has changed?
While I still agree with the result of that poll, I feel that it's different for plotting algorithms since we want to use gnuplot scripts.
Both are still output files. The job of every algorithm is to output some kind of data, whether it gets displayed or not.
I'm fine with another poll; I think it would be interesting to see, what the standpoint of the community is now. But I'm not sure, whether or not we should treat both output files the same. In the end, they are the same file logically, while the use-case differs.
I think I might have been against that case a while ago because we already had a few implementations for that method and I didn't want to go through and change the implementations we already had; however, I now agree that it would be nice to have some data file to check against even in that case.
For the cases we need to plot data, I also feel this data should be available.
Here is a poll for this discussion: https://www.strawpoll.me/16535759
I'll advertise it on stream tomorrow and Saturday when prepping for Hacktoberfest and decide from there.
There is also just plain diff FILE1 FILE2 if you want to compare data files. You don't need git for that.
After thinking about the topic, I found several problem points of an output file, like we imagined:
Consistency
While we can provide an output file for the fractal algorithms and others, we can't for algorithms with random input. We would therefore set different standards for different algorithms, which hurts consistency of the AAA implementations.
Integrity
While at the first glance an output file might look like a convenient way to compare the output of different implementations, we can't be sure, that the output file, provided with the algorithm implementation, really represents the output. An update might come in and the author forgets to update the file. The only way, we can really be sure, what the output of a program is, is by running it. So all the output file does, is introducing another potential error. The output file therefore provides the delusion of integrity, while just causing potential harm.
Testing
Another thought that got mentioned is testing. But even that isn't really possible with the output file. Continuing the arguments of the Integrity point, we can't be fully sure, that the output file really provides the output of the current implementation. Testing the output file is therefore the same as asking the author, if they think, that the provided implementation works right. Since we already do that, moving on to testing an output file doesn't provide any benefit.
The only way, we can enhance testing of the implementations output, is by automatically running the code and comparing the actual output of the program - not the output file - to a pre-defined correct value.
Conclusion
I think, we should be really sure, before we introduce the little convenience of an unreliable, potentially harmful output file and the costs and potential error associated with it.
I didn't realize, that we were talking about one master output file. In this case, I think it's a valid option, since the above points don't apply.
I've considered the question more and I'm not so sure anymore. A problem that wasn't brought up is floating point values. Depending on the language, numerical errors might be handled differently. If you run a diff you might get hundreds of differences at the 16th decimal point or something. Then what?
Split-op isn't done this way, but a width for printing should be set, something like 8 decimal places, which should avoid that problem. Really, a width that's sufficient to not suffer from floating point error, but indicates a problem with your algorithm.