dev_guide icon indicating copy to clipboard operation
dev_guide copied to clipboard

Example datasets should be chosen carefully

Open maelle opened this issue 11 months ago • 6 comments

  • Easy to install of course
  • But also inclusive + not offensive

maelle avatar Feb 06 '25 07:02 maelle

So, as this issue started out of the use of the iris dataset, and I took the editor remarks on their suitability appropriate, I came to the following conclusion.

Millions of R users, and countless students of statistics who do not use R are familiar with the iris dataset, which is about flowers; however, it is entirely appropriate to criticise the use of this dataset due to the fact that for many years Ronald A. Fisher, who introduced iris into statistical education, promoted his eugenic ideas through his activity as a member of the Eugenics Society and its council.

The dataset that is included in base R and used in our tutorial was created by Edgar Anderson. What cast some doubts about the ethical use of this dataset is that it was first brought into the education of statistics by Ronald A. Fisher (among many-many elements of modern statistics.) Fisher developed a linear discriminant model to distinguish the species in the dataset based on their features, which was published in the Annals of Eugenics (today the Annals of Human Genetics). The journal obtained its current name in 1954 to reflect changing perceptions on eugenics.

In their R Journal article, Allison M. Horst, Alison Presmanes Hill, and Kristen B. Gorman: Palmer Archipelago Penguins Data in the palmerpenguins R Package - An Alternative to Anderson’s Irises criticise the dataset for lacking data documentation, metadata, and various other aspects, including, in their view that the dataset "is burdened by a history in eugenics research."

Obviously, I think that it would be outrageous if we would use an eugenics example in vignettes, but I think it is mistaken to think that the the iris dataset is burdened by a history of eugenics. It is attributed to the botanist Edgar Anderson; in science, nobody is liable if their work is quoted by bad science or by racists. In science, we must quote sources, and the fact that problematic sources refer to a datasets do not cast a shadow on the original work. Furthermore, the work that brought the iris dataset into the education of statistics also does not seem to be associated with eugenics. It is about a linear discriminant model that is widely taught without any reference to eugenics.

Walter Bodmer, R. A. Bailey, Brian Charlesworth, Adam Eyre-Walker, Vernon Farewell, Andrew Mead & Stephen Senn: The outstanding scientist, R.A. Fisher: his views on eugenics and race analyse in depth the understandable criticism towards Fisher's association with eugenics. The purpose of their article "is neither to defend nor attack Fisher’s work in eugenics and views on race, but to present a careful account of their substance and nature." The authors conclude that recent criticism of R. A. Fisher concentrates "on very limited aspects of his work and focusses attention on some of his views, both in terms of science and advocacy. This is entirely appropriate, but in re-assessing his many contributions to society, it is important to consider all aspects, and to respond in a responsible way—we should not forget any negative aspects, but equally not allow the negatives to completely overshadow the substantial benefits to modern scientific research."

I do not think that the palmerpenguins dataset is a good candidate to replace iris in introductory to intermediate statistics, because it has missing values, it is imbalanced, and therefore it cannot be used in examples without the intermediate to advanced topics of imputation. Furthermore, R packages extend the R statistical ecosytem, and if the iris dataset is not suitable for its purpose, the criticism should be addressed to the R core team.

antaldaniel avatar Feb 17 '25 11:02 antaldaniel

@maelle I don't think that "easy to install" is too important here, as most package authors can and do choose from {datasets}. I think the only real issue is that we should discourage anybody from using iris because of this: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html. I would also trust that any cases that do arise can be handled by simple requests to use alternative datasets. Any explicit mention by us would only unnecessarily draw attention to that documentation. I think we can just close this without action. Feel free to do so if you agree.

mpadge avatar Feb 18 '25 10:02 mpadge

@mpadge I think that this controversy arises from the R Journal article that introduced palmerpenguins, which stated the vague, and the refutable claim that the "iris dataset is burdened by eugenics", which is not. The palmerpenguins is still a valuable dataset for cases where more metadata and missing data are needed, which is not the case in my package under review. I do believe that the dataset article, which is under the second round of revivew here, is a very valuable R package and it was intended to work as closely with base R as possible. Eventually I can remove iris from it, if that is the final verdict, but I think that the reason behind the advice to boycott the irs dataset is not really valid.

I had tried for weeks now to research this topic, and I did not find any criticisims that Edgar Anderson, the author of this dataset, had nothing to with eugenics. It is also questionable if R.A. Fisher had, who cited him. Even if R.A. Fisher would have conducted questionable himself, that is not discrediting the original author. Removing a dataset that has been a core example in statistical education for almost 90 years because of a questionable quotation of it would, in my opinion, a bad application of ethical scientific conduct. In science, it is mandatory to quote and author, and sometimes questionable articles quote an author. But the original author is not responsible for quotation, and iris is literally the most quoted dataset in the world.

Again, if I really need to, I will rewrite all the tutorials, but I think that recommendation to boycott iris should no longer stand. My package is, in my opinion, a great fit with several rOpenSci packages.

antaldaniel avatar Feb 19 '25 08:02 antaldaniel

Thanks @antaldaniel for your willingness to reconsider, which is indeed what we intend to request in the review thread. We also do not intend to muddy that issue with debates on Fisher, so let me note here: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html has the "correct" citation for that dataset as Annals of Eugenics.. The additional attribution to Anderson is an appropriate and necessary acknowledgement for the original source of the data, but Anderson's published manuscript is purely descriptive and has no direct relationship with the {iris} dataset. It was Fisher who published the data, with his name alone as author, and therefore Fisher is the original "author" of the dataset. Its appearance in that article is not a "quotation", that article and that article alone represents the singular published origin of the dataset. And that article was published in Annals of Eugenics. rOpenSci considers the use of any source material from that journal to be inappropriate.

In short: The {iris} dataset was originally published by Fisher, and only Fisher, in a definitely problematic journal. We will update our guidelines to clearly indicate that we find any further usage of data from such a problematic source to be against our goals of inclusiveness. Our Code of Conduct states: "We prioritize marginalized people’s safety over privileged people’s comfort." We accordingly judge further usage of this dataset as unacceptable.

mpadge avatar Feb 19 '25 10:02 mpadge

@mpadge Thank you for this clarification. I would find it ethical to provide a sunset date because the review of my package started about 2 years ago, and the removal of the iris package would be a significant amount of work. The entire package has nothing to do with either iris or botany or eugenics, and I would greatly appreciate if at least the actual review would go on so that I am aware of the other requirements to be included in rOpenSci.

I understand that you will create a list of publication that cannot be cited, including Annals of Eugenics (today the Annals of Human Genetics). By the time that list of inadmissible citaitons will be made public, I will create a new tutorial. However, it would be also helpful to get feedback on what the package should do, so that the tutorial is already created with new features.

antaldaniel avatar Feb 19 '25 12:02 antaldaniel

All right, I think that I will rewrite the package, even though this means chaning about 40 examples and two vignettes, which were already in place in the first round of the review. I find however your curation method flawed. In library sciene and digital humanities, but also in natural and social sciences, we work a lot with similar issues, which are far more problematic. Scientific collections created in the 19th and early 20th century are full with racist and colonialist labels and descriptions. Yet we do not through out those invaluable collections, specimen, artefacts, but carefully manage the data provenance so that they can be presented without offence to a 21st century scientific community.

The iris dataset is far less problematic than most of the cases I know. Edgar Anderson is the undisputed creator of the dataset, Fisher never claimed authorship; and Edgar Anderson had nothing to do with racism. Disallowing quoting his work is not in line with the ethics of scientific qutation. Then, by the time Fisher quoted his work, the Annals of Human Genetics had nothing to do with eugenics for a long time. The journal changed its name 70 years ago, but already when Fisher the iris dataset published there, it did not deal with eugenics in the sense that we rightly find offensive. And of course, Fisher's article is pure statistical theory. So we are dealing with a similar issue to those scientific collections that have objectionable original labelling, but in those cases, the correct way is to change the labelling, and not deleting the earlier scientific discovery.

It would be perfectly OK to cite Edgar Anderson directly, because that would fit the modern norm of citing datasets; exactly the way it was in my package for three years. It would be also completely OK to inform the reader of Fisher's article somehow that his article has nothing to do with what made eugenics later infamous, for example, by using the journal's title that had been in use for the last 70 years, and leaving trails for those who are really interested in the story how Fisher got associated with eugenics, what eugenics meant before Nazism etc.

I really do not think that it is a good curatorial policy to deem "sources" inadmissible, I think that certain publications should be inadmissible. Neither Fisher's article quoting Anderson's dataset, nor Anderson's work is offensive for any modern scientist. On the other hand, it would be truly objectionable if somebody quoted datasets that were used in eugenics, regardless if they were published in a benevolent-named journal or not.

I will remove the iris dataset in about a month as I work on the package, and I will keep this story as reminder of the importance of the data provenance managemnt features of the package, which are still in infancy.

antaldaniel avatar Feb 23 '25 20:02 antaldaniel