filecoin-plus-large-datasets icon indicating copy to clipboard operation
filecoin-plus-large-datasets copied to clipboard

[DataCap Application] MongoStorage - CommonCrawl-2020-45

Open amughal opened this issue 2 years ago • 118 comments
trafficstars

Data Owner Name

Common Crawl

Data Owner Country/Region

United States

Data Owner Industry

Other

Website

https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/

Social Media

None.

Total amount of DataCap being requested

1000 TiB

Weekly allocation of DataCap requested

50TiB

On-chain address for first allocation

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Custom multisig

  • [ ] Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan, soon to be ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions at ESPA trainings organized by PikNik in Vegas.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

MongoStorage is participant in the Slingshot V3, both as SP and DataPrep.

Describe the data being stored onto Filecoin

The Common Crawl project is a corpus of web crawl data composed of over 50 billion web pages. This is the crawl archive for October 2020. The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Where was the data currently stored in this dataset sourced from

Other

If you answered "Other" in the previous question, enter the details here

Data is available through CommonCrawl website. Data has already been prepared in CAR files according to the Slingshot v3 requirements. As written on the Slingshot V3, participants are allowed to place this data on the BDE exchange for bidding. Once this request is approved, I would be talking with the BDE team to place this dataset for bidding.

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

Primary data available through:
https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/

List of archived files are available in the compressed file e.g.
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-45/warc.paths.gz

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • [X] I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Daily

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent)

How will you be distributing your data to storage providers

HTTP or FTP server

How do you plan to choose storage providers

Slack, Big data exchange

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

amughal avatar Mar 04 '23 05:03 amughal

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

To unblock Mongo as a Data Preparer in absence of Spade, I asked Mongo to leverage BDE + LDN for the datasets he prepared months ago. These datasets from Common Crawl are deemed useful to store for preservation of humanity information by the Slingshot community.

xmcai2016 avatar Mar 06 '23 20:03 xmcai2016

The minimum Datacap requested is 500TB.

Sunnyiscoming avatar Mar 07 '23 06:03 Sunnyiscoming

@Sunnyiscoming My total requested is 100TB and minimum weekly request is 50TB. Are you saying that for BDE data publishing, the minimum requirement is 500TIB?

amughal avatar Mar 07 '23 11:03 amughal

image @amughal you should ask for 1000 TiB. Each dataset copy is ~100 TiB and Slingshot encourages 10 copies for disaster resiliency. If you want to apply on behalf of all your Common Crawl datasets that you prepared, this number would be even higher.

xmcai2016 avatar Mar 07 '23 19:03 xmcai2016

Thank you for the guidance, and that makes sense and now I understand what @Sunnyiscoming was suggesting. Let me update this request to reflect the current dataset which is ready. I will post request for the next round as more data sets will be fully ready. Thank you

amughal avatar Mar 07 '23 21:03 amughal

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Datacap Request Trigger

Total DataCap requested

1000TiB

Expected weekly DataCap usage rate

50TiB

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Sunnyiscoming avatar Mar 08 '23 12:03 Sunnyiscoming

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

DataCap allocation requested

25TiB

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

This LDN is the dataset related to the famous project MoonLanding (Slingshot V3). The above links and messages have preliminarily proved that the data requirements match the application. Support and wish you all the best.

Joss-Hua avatar Mar 10 '23 01:03 Joss-Hua

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

25.00TiB

Signer Address

f1tfg54zzscugttejv336vivknmsnzzmyudp3t7wi

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e

Joss-Hua avatar Mar 10 '23 01:03 Joss-Hua

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

25.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224

liyunzhi-666 avatar Mar 10 '23 02:03 liyunzhi-666

I have heard of the Moon Landing project before and would like to support @amughal and @xmcai2016

liyunzhi-666 avatar Mar 10 '23 02:03 liyunzhi-666

Thank you all, appreciated.

amughal avatar Mar 10 '23 03:03 amughal

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

DataCap allocation requested

50TiB

Id

c88fb01f-6f1c-4717-a914-6c2c598edab5

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Rule to calculate the allocation request amount

100% of weekly dc amount requested

DataCap allocation requested

50TiB

Total DataCap granted for client so far

25TiB

Datacap to be granted to reach the total amount requested by the client (1000 TiB)

975TiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
506 3 25TiB 56.68 5.41TiB

@Sunnyiscoming @Kevin-FF-USA @galen-mcandrew @raghavrmadya @simonkim0515 Hello All, Seems like there is a datacap issue in this approval. I started sending large deals out of this LDN to the SPs in the last two days, but as of this morning, it is failing. The status is asking for signature again. Is this the weekly allocation issue, or tranche, trying to understand. With using SaaS provider, I need to send deals ASAP. Any help is appreciated. Thanks

amughal avatar Jun 21 '23 13:06 amughal

My initial request was to allocate 50TB per week. Can I get that increased to 100TB, please?

amughal avatar Jun 21 '23 13:06 amughal

checker:manualTrigger

liyunzhi-666 avatar Jun 22 '23 02:06 liyunzhi-666

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

  • Overall Graphsync retrieval success rate: 83.01%
  • Overall HTTP retrieval success rate: 0.00%
  • Overall Bitswap retrieval success rate: 0.00%

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

checker:manualTrigger

amughal avatar Jul 08 '23 17:07 amughal

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

  • Overall Graphsync retrieval success rate: 67.29%
  • Overall HTTP retrieval success rate: 0.00%
  • Overall Bitswap retrieval success rate: 0.00%

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

@fabriziogianni7 @liyunzhi-666 @ Hello Notaries. I need next tranche for this LDN.

  1. I had accidentally a small set of CAR files mixed with another LDN, but i will make sure that this won't happen again.
  2. In the next round of data seiling, next two SPs are fully GEO diverse (Asia and East Coast US).

Please let me know if you have any questions.

Thanks

amughal avatar Jul 08 '23 17:07 amughal

That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal

liyunzhi-666 avatar Jul 10 '23 09:07 liyunzhi-666

That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal

Okay thanks @liyunzhi-666, appreciated. I will reach to others.

amughal avatar Jul 10 '23 10:07 amughal

Checking with other Notaries. Hello, @simonkim0515 @xinaxu @kevzak Could someone please approve the next tranche?

Thanks

amughal avatar Jul 10 '23 17:07 amughal

@amughal is a previous ESPA participant and reputable in the ecosystem. Dataset is a public dataset so that checks out as well.

Approving the next datacap tranche however would like to see more replication across more SPs going forward.

jamerduhgamer avatar Jul 12 '23 00:07 jamerduhgamer

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

50.00TiB

Signer Address

f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi

Id

c88fb01f-6f1c-4717-a914-6c2c598edab5

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae

jamerduhgamer avatar Jul 12 '23 00:07 jamerduhgamer

Thank you @jamerduhgamer . Definitely the next tranche will be hosted by another SP and will also showcase the GEO redundancy.

amughal avatar Jul 12 '23 01:07 amughal

@amughal Hi there

  1. It would be clear if you could tell us the details about SPid you will work with for the next round.
  2. As you mentioned that you will have 10 copies. How will you improve data replication?

ipollo00 avatar Jul 12 '23 09:07 ipollo00