ReScience-submission Re-implementation of open source methods in another language

Dear Rescience editors. In the course of our work, we have created a Python implementation of a method that was previously available as open-source R code. Is this implementation within the scope of Rescience? Thanks! cc:@kpolimis, @bhazelton

Apr 18 '16 21:04 arokem

Do you have a reference paper to target for the replication ?

Apr 19 '16 06:04 rougier

Yes. This is the paper: http://jmlr.csail.mit.edu/papers/volume15/wager14a/wager14a.pdf, and the previous implementation: https://github.com/swager/randomForestCI

Apr 19 '16 14:04 arokem

It is okay as long as you do not make a simple "translation" of the R code. The idea of the replication is really to check if the original article is self-sufficient when describing method or model (i.e. without the accompanying code) or if some information is incorrect or missing. In the end, the original article + your article should be sufficient for future replications.

@khinsen What do you think ?

Apr 19 '16 19:04 rougier

This seems like a better fit for a project like http://contrib.scikit-learn.org/ to me.

On Tue, 19 Apr 2016 at 21:23 Nicolas P. Rougier [email protected] wrote:

It is okay as long as you do not make a simple "translation" of the R code. The idea of the replication is really to check if the original article is self-sufficient when describing method or model (i.e. without the accompanying code) or if some information is incorrect or missing. In the end, the original article + your article should be sufficient for future replications.

@khinsen https://github.com/khinsen What do you think ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ReScience/ReScience-submission/issues/16#issuecomment-212077100

Apr 19 '16 19:04 FedericoV

Didn't know this. But anyway, you can do both actually (publication and contribution).

Apr 19 '16 19:04 rougier

@rougier I agree: the main point of ReScience is doing replication in the sense of writing a new implementation that should produce results identical to published ones. If a published implementation already, we should ask for a "clean-room reimplementation" although this can of course not be verified.

In my personal experience, a second independent implementation is a great way to find mistakes (in both implementations), so I am tempted to suggest that we even encourage that kind of submission for ReScience.

Apr 20 '16 15:04 khinsen

There's an additional issue as well: most code in R is licensed GPL, while most Python code is licensed MIT. If the code is a clean room implementation, you can use MIT/BSD as a license, while if it is a derivative work of the R code, you have to use it as GPL, which limits its use within Python.

On Wed, 20 Apr 2016 at 17:57 Konrad Hinsen [email protected] wrote:

@rougier https://github.com/rougier I agree: the main point of ReScience is doing replication in the sense of writing a new implementation that should produce results identical to published ones. If a published implementation already, we should ask for a "clean-room reimplementation" although this can of course not be verified.

In my personal experience, a second independent implementation is a great way to find mistakes (in both implementations), so I am tempted to suggest that we even encourage that kind of submission for ReScience.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/ReScience/ReScience-submission/issues/16#issuecomment-212489055

Apr 20 '16 16:04 FedericoV

Replication is not a derivative work for me.

Apr 20 '16 16:04 rougier

I am not a lawyer, but I believe that if you look at GPL code while you implement the Python code, it counts as derivative.

On Wed, 20 Apr 2016 at 18:02 Nicolas P. Rougier [email protected] wrote:

Replication is not a derivative work for me.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/ReScience/ReScience-submission/issues/16#issuecomment-212490668

Apr 20 '16 16:04 FedericoV

Might be generally of relevance, but not for this particular case. We did not do a "clean room" implementation (IIUC what that would entail). Instead, we looked at the original code, but in this case, it is under an MIT license.

On Wed, Apr 20, 2016 at 9:04 AM, Federico Vaggi [email protected] wrote:

I am not a lawyer, but I believe that if you look at GPL code while you implement the Python code, it counts as derivative.

On Wed, 20 Apr 2016 at 18:02 Nicolas P. Rougier [email protected] wrote:

Replication is not a derivative work for me.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub < https://github.com/ReScience/ReScience-submission/issues/16#issuecomment-212490668

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/ReScience/ReScience-submission/issues/16#issuecomment-212491219

Apr 20 '16 16:04 arokem

You have of course the right to look at the code, but the idea is to start from the paper and to look at the code only if there is a missing piece of information in the paper or something remains obscure. Else, if the original author made a mistake, you could end up just translating that mistake in your code.

Apr 20 '16 16:04 rougier

When you say mistake you mean a bug (of whatever type) as opposed to a mistake in the journal article, right?

Apr 20 '16 17:04 oliviaguest

No, I mean a mistake in the code in the sense that the code does not implement what is advertised in the paper. For example you can write you're integrating an equation using the Runge-Kutta numerical method while the code actually uses the explicit Euler methods. In some cases this won't make a difference, but in some other cases, this could lead to different results and hence, this must be reported in the new article.

Apr 20 '16 17:04 rougier

I think we disagree on terminology, but not on the solution. If the implementation (code) doesn't match the specification (journal article), I would class that as a bug (a mistake in the code, specifically can be seen as a logic error) and as a mistake in the journal article.

Apr 20 '16 18:04 oliviaguest

To clarify, in case not clear from above, I do not mean that the presence of logic errors means there is a mistake in the journal article. There might be many logic errors without any mistakes in the article, merely because it does not matter in those specific cases that logical errors exist. But all mismatches between reported specification and implementation, directly require/imply a mistake in the journal article in the case where the journal article serves as the only spec.

Apr 20 '16 19:04 oliviaguest

I agree. This is precisely the goal of replication in ReScience: to spot such mistake (an also missing information) and to report them such that the two articles (original + replication) constitutes now a complete spec. For me the added value of replications in ReScience is more the article than the code.

For me, bug (or errors) are something different (and worse) because they can invalidate results. For example let's imagine you're using a fixed seed in your random generator (for debug) and you forgot to remove it before making stats using several runs of your model. This may very well invalidate all the results.

Apr 20 '16 20:04 rougier

I think it's just terminology/jargon that we disagree on. Basically 100% agreed. :smile:

Apr 20 '16 20:04 oliviaguest