starcoder2 What prevents you from throughly opensourcing?

I noticed that even though bigcode/starcoder(2) is much opener than code llama and deepseekcoder, eg. open-sourced datasets, clearly described data processing and training, and so on, it is still not thoroughly open; code used for pretraining and data processing has never been open-source. So just out of curiosity, what prevents you from that?

Feb 29 '24 02:02 yucc-leon

Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.

Feb 29 '24 03:02 UniverseFly

Just want to point out that the data processing pipeline is open-source (https://github.com/bigcode-project/the-stack-v2). It is also the case for StarCoder1 (https://github.com/bigcode-project/bigcode-dataset/). To my knowledge, StarCoders are the only code LLMs with such a great transparency.

Wow I just found this repo and sorry for my ignorance...

Feb 29 '24 03:02 yucc-leon

#4 The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:

Intellectual Property: Protecting unique innovations or proprietary techniques.

Security Concerns: Preventing misuse of powerful AI models.

Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.

Resource Constraints: The high resource requirement for supporting an open-source project.

Legal Agreements: Restrictions due to collaborations or partnerships.

Data Privacy: Compliance with legal constraints related to data privacy and copyright.

These factors balance transparency with practical concerns like security, legal, and resource management.

Mar 01 '24 07:03 udaygiri

#4 The reasons for not fully open-sourcing pretraining and data processing code in projects like BigCode/StarCoder(2) may include:BigCode/StarCoder(2) 等项目中未完全开源预训练和数据处理代码的原因可能包括：

Intellectual Property: Protecting unique innovations or proprietary techniques.知识产权：保护独特的创新或专有技术。

Security Concerns: Preventing misuse of powerful AI models.安全问题：防止滥用强大的人工智能模型。

Quality and Reputation: Ensuring the quality of the code and avoiding negative impacts from misuse.质量和声誉：确保代码的质量并避免滥用造成的负面影响。

Resource Constraints: The high resource requirement for supporting an open-source project.资源限制：支持开源项目对资源的要求很高。

Legal Agreements: Restrictions due to collaborations or partnerships.法律协议：由于合作或伙伴关系而产生的限制。

Data Privacy: Compliance with legal constraints related to data privacy and copyright.数据隐私：遵守与数据隐私和版权相关的法律约束。

These factors balance transparency with practical concerns like security, legal, and resource management.这些因素平衡了透明度与安全、法律和资源管理等实际问题。

Thanks a lot.

Mar 04 '24 06:03 yucc-leon