filecoin-plus-large-datasets icon indicating copy to clipboard operation
filecoin-plus-large-datasets copied to clipboard

[DataCap Application] Commoncrawl(3/3)

Open nicelove666 opened this issue 1 year ago • 66 comments

Data Owner Name

Commoncrawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

2.5PiB

Number of replicas to store

6

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

  • [ ] Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2287
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

China

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • [X] I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

nicelove666 avatar Jan 02 '24 07:01 nicelove666

Please provide ID, City, Country, Organization of each SP here.

Sunnyiscoming avatar Jan 02 '24 13:01 Sunnyiscoming

Provider Location SP Entity or Personal
f02199203 Inner Mongolia Richard
f02223170 HK tianyou
f02831201 GuangDong Juwu Mine
f02824157 BeiJing zhongchuangyun

This is our cooperative SP. Around January 15th, we will add 5-7 SPs from Japan, Vietnam and Hong Kong. When they are launched, we will list them, thank you.

nicelove666 avatar Jan 03 '24 11:01 nicelove666

Hello, per the https://github.com/filecoin-project/notary-governance/issues/922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.

Sunnyiscoming avatar Jan 05 '24 14:01 Sunnyiscoming

SP List provided: [{"providerID":"f02199203","City":"InnerMongolia","Country":"China","SPOrg","Richard"}, {"providerID":"f02223170","City":"HK","Country":"China","SPOrg","tianyou"}, {"providerID":"f02831201","City":"GuangDong","Country":"China","SPOrg","JuwuMine"}, {"providerID":"f02824157","City":"BeiJing","Country":"China","SPOrg","zhongchuangyun"},]

Sunnyiscoming avatar Jan 05 '24 14:01 Sunnyiscoming

WX20240109-112150@2x We submitted it, thank you

nicelove666 avatar Jan 09 '24 03:01 nicelove666

https://www.ipqualityscore.com/user/search is a public, well-known and unbiased geolocation detection software. I paid to check the SP we cooperate with, and it turns out that their address location is real. f02199203 116.136.130.130 f02824157 116.172.66.38 f02824140 116.172.66.38 f02841613 210.209.77.161 f02831202 14.29.124.50 f0122215 119.167.140.136

Detection method: Find the IP corresponding to the sp in boost, enter the IP, and you can see the detection results. If SP use VPN, the detection score may be greater than 70 points. The detection score is 0 points,means no fraud, which proves that the SP's address is an honest address.

nicelove666 avatar Jan 11 '24 07:01 nicelove666

WX20240111-145600@2x WX20240111-145455@2x

nicelove666 avatar Jan 11 '24 07:01 nicelove666

WX20240111-144512@2x WX20240111-145015@2x WX20240111-145537@2x WX20240111-182502@2x

nicelove666 avatar Jan 11 '24 10:01 nicelove666

Can you help us move forward, thank you. @Sunnyiscoming

nicelove666 avatar Jan 11 '24 10:01 nicelove666

It took two weeks to apply, but it still hasn’t been approved. Therefore, the cooperative SP has changed, we have updated the cooperative SP:

f02199203 Richard Nei Mongol(Inner Mongolia) 116.136.130.130 WX20240116-160041@2x

f02824157 zhongchuangyun GuangDong 116.172.66.38 f02824140 zhongchuangyun GuangDong 116.172.66.38 2@2x

f02831202 Juwu Mine GuangDong 14.29.124.50 WX20240116-160408@2x

f0122215 SuSuanYun ShanDong 119.167.140.136 WX20240116-160454@2x

nicelove666 avatar Jan 16 '24 08:01 nicelove666

This can clearly display the address location of each SP. Facts have proved that the SPs we cooperate with are honest and hope to get your approved. @Sunnyiscoming @Filplus-govteam @galen-mcandrew @Kevin-FF-USA @clriesco

nicelove666 avatar Jan 16 '24 08:01 nicelove666

Please tell me, what else do I need to do? @Sunnyiscoming

nicelove666 avatar Jan 17 '24 11:01 nicelove666

Deleting comment

@Sunnyiscoming hasn't the permissions to post this comment.

Please, contact the assignee of this issue.

Datacap Request Trigger

Total DataCap requested

15PiB

Expected weekly DataCap usage rate

1PiB

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Sunnyiscoming avatar Jan 17 '24 14:01 Sunnyiscoming

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

ipollo00 avatar Jan 22 '24 08:01 ipollo00

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f12mckci3omexgzoeosjvstcfxfe4vqw7owdia3da

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw

SuperChaiChai avatar Jan 22 '24 23:01 SuperChaiChai

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

checker:manualTrigger

nicelove666 avatar Jan 29 '24 00:01 nicelove666

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1mdk7s2vntzm6hu35yuo6vjubtrpfnb2awhgvrri

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6

1ane-1 avatar Jan 29 '24 01:01 1ane-1

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1dnb3uz7sylxk6emti3ififcvu3nlufnnsjui6ea

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re

mikezli avatar Jan 29 '24 01:01 mikezli

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

checker:manualTrigger

AlanGreaterheat avatar Jan 29 '24 06:01 AlanGreaterheat

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1pnmzlxj7cfeo2v6oj5nco46hkg2l46wj7o4xxui

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag

AlanGreaterheat avatar Jan 29 '24 06:01 AlanGreaterheat

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1c5non5yf35avgcpsqvxu4yj54yyvxorwyjochqq

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww

Normalnoise avatar Jan 30 '24 00:01 Normalnoise

DataCap Allocation requested

Request number 3

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

1PiB

Id

9a7cfd96-e241-4712-98c6-54a9708ba57f

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

1.00PiB

Signer Address

f1xrnysd4gimg64d4l6qi7ulzwwq22c6vfg6lpw3i

Id

9a7cfd96-e241-4712-98c6-54a9708ba57f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k

Aaron01230 avatar Jan 30 '24 05:01 Aaron01230

Please pay attention to distribute to outside of GCR, otherwise LGTM.

kernelogic avatar Jan 30 '24 06:01 kernelogic