starcoder2
starcoder2 copied to clipboard
What prevents you from throughly opensourcing?
I noticed that even though bigcode/starcoder(2) is much opener than code llama and deepseekcoder, eg. open-sourced datasets, clearly described data processing and training, and so on, it is still not thoroughly open; code used for pretraining and data processing has never been open-source. So just out of curiosity, what prevents you from that?
Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.
Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.
Wow I just found this repo and sorry for my ignorance...
#4 The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:
Intellectual Property: Protecting unique innovations or proprietary techniques.
Security Concerns: Preventing misuse of powerful AI models.
Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.
Resource Constraints: The high resource requirement for supporting an open-source project.
Legal Agreements: Restrictions due to collaborations or partnerships.
Data Privacy: Compliance with legal constraints related to data privacy and copyright.
These factors balance transparency with practical concerns like security, legal, and resource management.
#4 The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:BigCode/StarCoder(2) 等项目中未完全开源预训练和数据处理代码的原因可能包括:
Intellectual Property: Protecting unique innovations or proprietary techniques.知识产权:保护独特的创新或专有技术。
Security Concerns: Preventing misuse of powerful AI models.安全问题:防止滥用强大的人工智能模型。
Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.质量和声誉:确保代码的质量并避免滥用造成的负面影响。
Resource Constraints: The high resource requirement for supporting an open-source project.资源限制:支持开源项目对资源的要求很高。
Legal Agreements: Restrictions due to collaborations or partnerships.法律协议:由于合作或伙伴关系而产生的限制。
Data Privacy: Compliance with legal constraints related to data privacy and copyright.数据隐私:遵守与数据隐私和版权相关的法律约束。
These factors balance transparency with practical concerns like security, legal, and resource management.这些因素平衡了透明度与安全、法律和资源管理等实际问题。
Thanks a lot.