filecoin-plus-large-datasets
filecoin-plus-large-datasets copied to clipboard
[DataCap Application] Commoncrawl(3/3)
Data Owner Name
Commoncrawl
What is your role related to the dataset
Data Preparer
Data Owner Country/Region
United States
Data Owner Industry
Life Science / Healthcare
Website
https://commoncrawl.org/
Social Media
https://commoncrawl.org/
Total amount of DataCap being requested
15PiB
Expected size of single dataset (one copy)
2.5PiB
Number of replicas to store
6
Weekly allocation of DataCap requested
1PiB
On-chain address for first allocation
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Data Type of Application
Public, Open Dataset (Research/Non-Profit)
Custom multisig
- [ ] Use Custom Multisig
Identifier
No response
Share a brief history of your project and organization
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.
Is this project associated with other projects/ecosystem stakeholders?
Yes
If answered yes, what are the other projects/ecosystem stakeholders
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2287
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204
Describe the data being stored onto Filecoin
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.
Where was the data currently stored in this dataset sourced from
AWS Cloud
If you answered "Other" in the previous question, enter the details here
No response
If you are a data preparer. What is your location (Country/Region)
China
If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?
We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.
If you are not preparing the data, who will prepare the data? (Provide name and business)
No response
Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.
This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.
Please share a sample of the data
https://commoncrawl.org/
Confirm that this is a public dataset that can be retrieved by anyone on the Network
- [X] I confirm
If you chose not to confirm, what was the reason
No response
What is the expected retrieval frequency for this data
Yearly
For how long do you plan to keep this dataset stored on Filecoin
2 to 3 years
In which geographies do you plan on making storage deals
Greater China, Asia other than Greater China, Europe
How will you be distributing your data to storage providers
Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives
How do you plan to choose storage providers
Slack, Big Data Exchange, Partners
If you answered "Others" in the previous question, what is the tool or platform you plan to use
No response
If you already have a list of storage providers to work with, fill out their names and provider IDs below
No response
How do you plan to make deals to your storage providers
Boost client, Lotus client
If you answered "Others/custom tool" in the previous question, enter the details here
No response
Can you confirm that you will follow the Fil+ guideline
Yes
Please provide ID, City, Country, Organization of each SP here.
| Provider | Location | SP Entity or Personal |
|---|---|---|
| f02199203 | Inner Mongolia | Richard |
| f02223170 | HK | tianyou |
| f02831201 | GuangDong | Juwu Mine |
| f02824157 | BeiJing | zhongchuangyun |
This is our cooperative SP. Around January 15th, we will add 5-7 SPs from Japan, Vietnam and Hong Kong. When they are launched, we will list them, thank you.
Hello, per the https://github.com/filecoin-project/notary-governance/issues/922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.
This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.
SP List provided: [{"providerID":"f02199203","City":"InnerMongolia","Country":"China","SPOrg","Richard"}, {"providerID":"f02223170","City":"HK","Country":"China","SPOrg","tianyou"}, {"providerID":"f02831201","City":"GuangDong","Country":"China","SPOrg","JuwuMine"}, {"providerID":"f02824157","City":"BeiJing","Country":"China","SPOrg","zhongchuangyun"},]
https://www.ipqualityscore.com/user/search is a public, well-known and unbiased geolocation detection software. I paid to check the SP we cooperate with, and it turns out that their address location is real. f02199203 116.136.130.130 f02824157 116.172.66.38 f02824140 116.172.66.38 f02841613 210.209.77.161 f02831202 14.29.124.50 f0122215 119.167.140.136
Detection method: Find the IP corresponding to the sp in boost, enter the IP, and you can see the detection results. If SP use VPN, the detection score may be greater than 70 points. The detection score is 0 points,means no fraud, which proves that the SP's address is an honest address.
Can you help us move forward, thank you. @Sunnyiscoming
It took two weeks to apply, but it still hasn’t been approved. Therefore, the cooperative SP has changed, we have updated the cooperative SP:
f02199203 Richard Nei Mongol(Inner Mongolia) 116.136.130.130
f02824157 zhongchuangyun GuangDong 116.172.66.38
f02824140 zhongchuangyun GuangDong 116.172.66.38
f02831202 Juwu Mine GuangDong 14.29.124.50
f0122215 SuSuanYun ShanDong 119.167.140.136
This can clearly display the address location of each SP. Facts have proved that the SPs we cooperate with are honest and hope to get your approved. @Sunnyiscoming @Filplus-govteam @galen-mcandrew @Kevin-FF-USA @clriesco
Please tell me, what else do I need to do? @Sunnyiscoming
Deleting comment
@Sunnyiscoming hasn't the permissions to post this comment.
Please, contact the assignee of this issue.
Datacap Request Trigger
Total DataCap requested
15PiB
Expected weekly DataCap usage rate
1PiB
Client address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
DataCap Allocation requested
Multisig Notary address
f02049625
Client address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
DataCap allocation requested
512TiB
Id
af9739bf-5ed7-4e71-a41a-9703387d3d7c
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri
Id
af9739bf-5ed7-4e71-a41a-9703387d3d7c
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2
Request Approved
Your Datacap Allocation Request has been approved by the Notary
Message sent to Filecoin Network
bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f12mckci3omexgzoeosjvstcfxfe4vqw7owdia3da
Id
af9739bf-5ed7-4e71-a41a-9703387d3d7c
You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw
DataCap Allocation requested
Request number 2
Multisig Notary address
f02049625
Client address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
DataCap allocation requested
512TiB
Id
d029cae8-cd31-43e7-a662-5bf32b3fdcda
checker:manualTrigger
DataCap and CID Checker Report Summary[^1]
Storage Provider Distribution
✔️ Storage provider distribution looks healthy.
Deal Data Replication
✔️ Data replication looks healthy.
Deal Data Shared with other Clients[^3]
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Full report
Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f1mdk7s2vntzm6hu35yuo6vjubtrpfnb2awhgvrri
Id
d029cae8-cd31-43e7-a662-5bf32b3fdcda
You can check the status of the message here: https://filfox.info/en/message/bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6
Request Approved
Your Datacap Allocation Request has been approved by the Notary
Message sent to Filecoin Network
bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f1dnb3uz7sylxk6emti3ififcvu3nlufnnsjui6ea
Id
d029cae8-cd31-43e7-a662-5bf32b3fdcda
You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re
DataCap Allocation requested
Request number 2
Multisig Notary address
f02049625
Client address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
DataCap allocation requested
512TiB
Id
edd783b6-a019-4a56-9dac-f19f3e94026a
checker:manualTrigger
DataCap and CID Checker Report Summary[^1]
Storage Provider Distribution
✔️ Storage provider distribution looks healthy.
Deal Data Replication
✔️ Data replication looks healthy.
Deal Data Shared with other Clients[^3]
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Full report
Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f1pnmzlxj7cfeo2v6oj5nco46hkg2l46wj7o4xxui
Id
edd783b6-a019-4a56-9dac-f19f3e94026a
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag
Request Approved
Your Datacap Allocation Request has been approved by the Notary
Message sent to Filecoin Network
bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
512.00TiB
Signer Address
f1c5non5yf35avgcpsqvxu4yj54yyvxorwyjochqq
Id
edd783b6-a019-4a56-9dac-f19f3e94026a
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww
DataCap Allocation requested
Request number 3
Multisig Notary address
f02049625
Client address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
DataCap allocation requested
1PiB
Id
9a7cfd96-e241-4712-98c6-54a9708ba57f
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k
Address
f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy
Datacap Allocated
1.00PiB
Signer Address
f1xrnysd4gimg64d4l6qi7ulzwwq22c6vfg6lpw3i
Id
9a7cfd96-e241-4712-98c6-54a9708ba57f
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k
Please pay attention to distribute to outside of GCR, otherwise LGTM.