open-research icon indicating copy to clipboard operation
open-research copied to clipboard

关于开源数据分析中数据隐私与道德伦理的问题

Open will-ww opened this issue 1 year ago • 0 comments

Description

最近投稿的 MSR 文章中,有一篇涉及到用户的地理位置、邮箱等信息,几个不同的评审专家都指出了数据安全隐私与伦理问题,我们后面在发表论文的时候需要同时考虑这个问题,尽量减低这方面的风险:

论文题目:Global Insights for GitHub Developers: A Fine-Grained Geographic Dataset and Analysis

相关评审意见:

  • Data Anonymization(Reviewer A)

    • Releasing a dataset of GitHub developers without anonymizing the data, particularly when including email addresses and inferred locations, raises serious privacy and security concerns. Email addresses are sensitive personal information, and the release of such data without consent can violate user privacy. Additionally, inferred locations, when combined with other details, may pose risks such as stalking and harassment. The dataset could become a target for spammers and phishing attacks, exposing individuals to unwanted communication.
    • The security risks associated with releasing email addresses include an increased susceptibility to email-based attacks, potentially compromising user accounts. Beyond the immediate privacy implications, there are legal and ethical concerns, as violating privacy laws could result in legal consequences. Releasing sensitive information without proper anonymization may damage the reputation of GitHub among developers.
    • To address these concerns, it is recommended to anonymize the data by removing or replacing personally identifiable
    • nformation and aggregating or generalizing location details. Obtaining explicit consent from individuals before including their information is crucial, and transparently communicating the purpose of the dataset can help build trust. Compliance with relevant data protection regulations, providing opt-out options, and prioritizing ethical considerations are essential to ensure responsible data handling and mitigate potential negative consequences for both individuals and organizations involved.
    • Even if this information is publicly available on GitHub and developers have willingly released certain details, the act of inferring and aggregating this information, making it accessible to everyone, may still raise concerns regarding privacy, security, and ethical data handling practices.
  • Ethical problems(Reviewer B)

    • The proposed dataset seems interesting and useful for future studies. Despite this, I have a major concern: in recent years, there has been a lot of discussion about the ethicality of mining accounts from GitHub, GitLab, etc. For this reason, Gold and Krinke [1] have provided guidelines to follow for ethical mining. The paper does not mention these guidelines and, in general, does not mention any ethical aspects. I strongly suggest adding a discussion of this point in the paper to reinforce the proposed dataset and give as much information as possible to help future practitioners understand whether or not they can use this dataset without having ethical problems.
    • Gold, Nicolas E., and Jens Krinke. "Ethics in the mining of software repositories." Empirical Software Engineering 27.1 (2022): 17
  • Anonymized problems(Reviewer C)

    • The data is not anonymized, which raises many privacy and ethical concerns. Please anonymise the data.

上面提到的一篇文章觉得还不错,后续写论文的时候可以考虑加进入,作为相关规范进行遵循:

will-ww avatar Jan 15 '24 01:01 will-ww