filecoin-plus-large-datasets
filecoin-plus-large-datasets copied to clipboard
[DataCap Application] MongoStorage - CommonCrawl-2020-45
Data Owner Name
Common Crawl
Data Owner Country/Region
United States
Data Owner Industry
Other
Website
https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/
Social Media
None.
Total amount of DataCap being requested
1000 TiB
Weekly allocation of DataCap requested
50TiB
On-chain address for first allocation
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
Custom multisig
- [ ] Use Custom Multisig
Identifier
No response
Share a brief history of your project and organization
MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan, soon to be ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions at ESPA trainings organized by PikNik in Vegas.
Is this project associated with other projects/ecosystem stakeholders?
Yes
If answered yes, what are the other projects/ecosystem stakeholders
MongoStorage is participant in the Slingshot V3, both as SP and DataPrep.
Describe the data being stored onto Filecoin
The Common Crawl project is a corpus of web crawl data composed of over 50 billion web pages. This is the crawl archive for October 2020. The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.
Where was the data currently stored in this dataset sourced from
Other
If you answered "Other" in the previous question, enter the details here
Data is available through CommonCrawl website. Data has already been prepared in CAR files according to the Slingshot v3 requirements. As written on the Slingshot V3, participants are allowed to place this data on the BDE exchange for bidding. Once this request is approved, I would be talking with the BDE team to place this dataset for bidding.
How do you plan to prepare the dataset
singularity
If you answered "other/custom tool" in the previous question, enter the details here
No response
Please share a sample of the data
Primary data available through:
https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/
List of archived files are available in the compressed file e.g.
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-45/warc.paths.gz
Confirm that this is a public dataset that can be retrieved by anyone on the Network
- [X] I confirm
If you chose not to confirm, what was the reason
No response
What is the expected retrieval frequency for this data
Daily
For how long do you plan to keep this dataset stored on Filecoin
1.5 to 2 years
In which geographies do you plan on making storage deals
Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent)
How will you be distributing your data to storage providers
HTTP or FTP server
How do you plan to choose storage providers
Slack, Big data exchange
If you answered "Others" in the previous question, what is the tool or platform you plan to use
No response
If you already have a list of storage providers to work with, fill out their names and provider IDs below
No response
How do you plan to make deals to your storage providers
Boost client, Singularity
If you answered "Others/custom tool" in the previous question, enter the details here
No response
Can you confirm that you will follow the Fil+ guideline
Yes
Thanks for your request! Everything looks good. :ok_hand:
A Governance Team member will review the information provided and contact you back pretty soon.
To unblock Mongo as a Data Preparer in absence of Spade, I asked Mongo to leverage BDE + LDN for the datasets he prepared months ago. These datasets from Common Crawl are deemed useful to store for preservation of humanity information by the Slingshot community.
The minimum Datacap requested is 500TB.
@Sunnyiscoming My total requested is 100TB and minimum weekly request is 50TB. Are you saying that for BDE data publishing, the minimum requirement is 500TIB?
@amughal you should ask for 1000 TiB. Each dataset copy is ~100 TiB and Slingshot encourages 10 copies for disaster resiliency. If you want to apply on behalf of all your Common Crawl datasets that you prepared, this number would be even higher.
Thank you for the guidance, and that makes sense and now I understand what @Sunnyiscoming was suggesting. Let me update this request to reflect the current dataset which is ready. I will post request for the next round as more data sets will be fully ready. Thank you
Thanks for your request! Everything looks good. :ok_hand:
A Governance Team member will review the information provided and contact you back pretty soon.
Datacap Request Trigger
Total DataCap requested
1000TiB
Expected weekly DataCap usage rate
50TiB
Client address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
DataCap Allocation requested
Multisig Notary address
f02049625
Client address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
DataCap allocation requested
25TiB
Id
04f1a47f-2177-44dd-a365-7e4c378aae43
This LDN is the dataset related to the famous project MoonLanding (Slingshot V3). The above links and messages have preliminarily proved that the data requirements match the application. Support and wish you all the best.
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e
Address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
Datacap Allocated
25.00TiB
Signer Address
f1tfg54zzscugttejv336vivknmsnzzmyudp3t7wi
Id
04f1a47f-2177-44dd-a365-7e4c378aae43
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e
Request Approved
Your Datacap Allocation Request has been approved by the Notary
Message sent to Filecoin Network
bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224
Address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
Datacap Allocated
25.00TiB
Signer Address
f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq
Id
04f1a47f-2177-44dd-a365-7e4c378aae43
You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224
I have heard of the Moon Landing project before and would like to support @amughal and @xmcai2016
Thank you all, appreciated.
DataCap Allocation requested
Request number 2
Multisig Notary address
f02049625
Client address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
DataCap allocation requested
50TiB
Id
c88fb01f-6f1c-4717-a914-6c2c598edab5
Stats & Info for DataCap Allocation
Multisig Notary address
f02049625
Client address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
Rule to calculate the allocation request amount
100% of weekly dc amount requested
DataCap allocation requested
50TiB
Total DataCap granted for client so far
25TiB
Datacap to be granted to reach the total amount requested by the client (1000 TiB)
975TiB
Stats
| Number of deals | Number of storage providers | Previous DC Allocated | Top provider | Remaining DC |
|---|---|---|---|---|
| 506 | 3 | 25TiB | 56.68 | 5.41TiB |
@Sunnyiscoming @Kevin-FF-USA @galen-mcandrew @raghavrmadya @simonkim0515 Hello All, Seems like there is a datacap issue in this approval. I started sending large deals out of this LDN to the SPs in the last two days, but as of this morning, it is failing. The status is asking for signature again. Is this the weekly allocation issue, or tranche, trying to understand. With using SaaS provider, I need to send deals ASAP. Any help is appreciated. Thanks
My initial request was to allocate 50TB per week. Can I get that increased to 100TB, please?
checker:manualTrigger
DataCap and CID Checker Report Summary[^1]
Retrieval Statistics
- Overall Graphsync retrieval success rate: 83.01%
- Overall HTTP retrieval success rate: 0.00%
- Overall Bitswap retrieval success rate: 0.00%
Storage Provider Distribution
✔️ Storage provider distribution looks healthy.
Deal Data Replication
⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.
Deal Data Shared with other Clients[^3]
⚠️ CID sharing has been observed. (Top 3)
- 10.48 TiB - f1ktvy56lqhtlo754tcna7sy2iwtvn6saz26wkeya - ``
- 544.00 GiB - f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi - CERN
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Full report
Click here to view the CID Checker report. Click here to view the Retrieval report.
checker:manualTrigger
DataCap and CID Checker Report Summary[^1]
Retrieval Statistics
- Overall Graphsync retrieval success rate: 67.29%
- Overall HTTP retrieval success rate: 0.00%
- Overall Bitswap retrieval success rate: 0.00%
Storage Provider Distribution
✔️ Storage provider distribution looks healthy.
Deal Data Replication
⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.
Deal Data Shared with other Clients[^3]
⚠️ CID sharing has been observed. (Top 3)
- 10.48 TiB - f1ktvy56lqhtlo754tcna7sy2iwtvn6saz26wkeya - ``
- 544.00 GiB - f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi - CERN
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Full report
Click here to view the CID Checker report. Click here to view the Retrieval report.
@fabriziogianni7 @liyunzhi-666 @ Hello Notaries. I need next tranche for this LDN.
- I had accidentally a small set of CAR files mixed with another LDN, but i will make sure that this won't happen again.
- In the next round of data seiling, next two SPs are fully GEO diverse (Asia and East Coast US).
Please let me know if you have any questions.
Thanks
That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal
That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal
Okay thanks @liyunzhi-666, appreciated. I will reach to others.
Checking with other Notaries. Hello, @simonkim0515 @xinaxu @kevzak Could someone please approve the next tranche?
Thanks
@amughal is a previous ESPA participant and reputable in the ecosystem. Dataset is a public dataset so that checks out as well.
Approving the next datacap tranche however would like to see more replication across more SPs going forward.
Request Proposed
Your Datacap Allocation Request has been proposed by the Notary
Message sent to Filecoin Network
bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae
Address
f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy
Datacap Allocated
50.00TiB
Signer Address
f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi
Id
c88fb01f-6f1c-4717-a914-6c2c598edab5
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae
Thank you @jamerduhgamer . Definitely the next tranche will be hosted by another SP and will also showcase the GEO redundancy.
@amughal Hi there
- It would be clear if you could tell us the details about SPid you will work with for the next round.
- As you mentioned that you will have 10 copies. How will you improve data replication?