filecoin-plus-large-datasets
filecoin-plus-large-datasets copied to clipboard
[DataCap Application] <FogMeta Lab> - <Open Datasets on AWS - bioinformatics>[1/2]
Data Owner Name
FogMeta Lab
Data Owner Country/Region
China
Data Owner Industry
Web3 / Crypto
Website
https://fogmeta.com
Social Media
Twitter: https://twitter.com/FogMeta
GitHub: https://github.com/FogMeta
Total amount of DataCap being requested
5PiB
Weekly allocation of DataCap requested
1PiB
On-chain address for first allocation
f1gmkpkvrsxvwveyfe3c3y3xejw6flbdrowc5jv6i
Custom multisig
- [ ] Use Custom Multisig
Identifier
No response
Share a brief history of your project and organization
FogMeta Lab's research spans multiple levels from system technology, infrastructure, and middleware to services and solutions, and involves future systems, network technology and business, distributed systems and management, information management, and interactive and innovative services. Based on the views on and practices in the industry, FogMeta also solves the problem of business complexity through operations optimization and other technologies.
Is this project associated with other projects/ecosystem stakeholders?
No
If answered yes, what are the other projects/ecosystem stakeholders
No response
Describe the data being stored onto Filecoin
These are all open datasets of the same subset(the "bioinformatics" category) on AWS. Please refer to the link here: https://registry.opendata.aws/tag/bioinformatics/.
Where was the data currently stored in this dataset sourced from
AWS Cloud
If you answered "Other" in the previous question, enter the details here
No response
How do you plan to prepare the dataset
IPFS, lotus, graphsplit, others/custom tool
If you answered "other/custom tool" in the previous question, enter the details here
We'd also like to use the Swan Client tool (https://github.com/filswan/go-swan-client#Graphsplit) to prepare the dataset.
Please share a sample of the data
1. 4D Nucleome (4DN)
s3://4dn-open-data-public/
2. Genome Aggregation Database (gnomAD)
s3://gnomad-public-us-east-1/
3. The Singapore Nanopore Expression Data Set
s3://sg-nex-data/
4. PubSeq - Public Sequence Resource
s3://pubseq-datasets/
5. Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)
s3://targetepigenomics/
6. Open Bioinformatics Reference Data for Galaxy
s3://biorefdata/
7. Basic Local Alignment Sequences Tool (BLAST) Databases
s3://ncbi-blast-databases/
8. Broad Genome References
s3://broad-references/
9. DNAStack COVID19 SRA Data
s3://dnastack-covid-19-sra-data/
Confirm that this is a public dataset that can be retrieved by anyone on the Network
- [X] I confirm
If you chose not to confirm, what was the reason
No response
What is the expected retrieval frequency for this data
Monthly
For how long do you plan to keep this dataset stored on Filecoin
2 to 3 years
In which geographies do you plan on making storage deals
Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica
How will you be distributing your data to storage providers
Cloud storage (i.e. S3), HTTP or FTP server, IPFS, Shipping hard drives, Others
How do you plan to choose storage providers
Slack, Partners, Others
If you answered "Others" in the previous question, what is the tool or platform you plan to use
We'd also like to use FilSwan platform (https://filswan.com/) to choose storage providers who meet our requirements.
If you already have a list of storage providers to work with, fill out their names and provider IDs below
The storage providers we'd like to work with are presented below. Some of them are from the FilSwan platform.
f01955033
f02029115
f03624
f010088
f02301
f08399
f02401
f01955030
f0187709
f01163272
f01402814
f01390330
f01225882
f0717969
f03223
f01395673
f01072221
f0143858
f01786736
f0836160
f032824
f01443744
f01871352
f01907556
f01955028
f01947280
f01946551
f02012951
f01970630
f0240185
How do you plan to make deals to your storage providers
Boost client, Lotus client, Others/custom tool
If you answered "Others/custom tool" in the previous question, enter the details here
Swan Client tool
https://github.com/filswan/go-swan-client
Can you confirm that you will follow the Fil+ guideline
Yes
So this would be a matter of enhancing the go-data-transfer protocol with an extra signalling message, potentially: https://github.com/filecoin-project/go-data-transfer/blob/ad43f2d453f12b8f32663c6e25c3fdb6c042aabf/message/message.go
I agree that we should improve the UX for outputing the state of the transfer.
As I understand it the current behaviour is:
- Graphsync client requests data Transfer moves to "Ongoing" state
- Graphsync server can process a limited number of concurrent requests, so it queues up the request to be processed later
- Client reports the state as "Ongoing" even though it hasn't started yet
If possible it would be nice to keep the protocol simple and avoid adding implementation-specific messages about queueing. I'd suggest instead that we just change the naming of our states in any output to the user:
- When the client sends a request output the state as "Queued"
- When the first byte is received output the state as "Ongoing"
Note that this would be a cosmetic change to the output, it doesn't require any logic changes in how we transition between states / fire events etc.
@dirkmc So while I agree that the simpler solution would immediately improve UX, from the correctness/safety perspective there is a difference between an explicit and implicit queued
event.
- The original post talks about an explicit event, which acts like an ACK, signalling to the requester that the transfer request has been accepted, but it's queued due to throttling directives.
- The simplified solution guesses that a zero size transfer
Also, I think the simplified solution would end up being confusing/insufficient in the face of restarts after a partial transfer. Here, the transferred amount will be non-zero, but the request may as well end up being queued on the responder side, which means that the peer will think it's ongoing when it isn't :-(
There are probably other edge cases like this, which is why I prefer state machines not to make assumptions and instead rely on explicit signalling of states between peers.
I agree that it's nice for the caller to have feedback on the state of their request. I suggest we think about it in terms of the trade-off between keeping the user informed and keeping the protocol simple. Each addition to the protocol needs to be supported by all future implementations.
As an example when a client opens an HTTP request, the request is implicitly added to a queue. The server can signal that it's overloaded with the response code "too many requests", but I don't believe there's anything like "I've queued your request".
HTTP is designed to be synchronous though, so the client has an expectation that the response will be almost immediate (and/or the server is starting work immediately). However, the graphsync/data-transfer protocol flows are asynchronous by design, and transfers could stay queued up for hours, which is one of the underlying reasons that this explicit state is necessary IMO.
If we do want to stick to the two-state approach, I would suggest different naming:
-
Queued
=>Requested
(client doesn't make assumptions about queueing, all it knows is that it sent the request) -
Ongoing
=>Receiving
However, with this two-state approach there is still the restart case, which will be relatively frequent and therefore should be non-ambiguous.
Agree with changing the naming, I think it's confusing at the moment 👍
@dirkmc On a restart of the transfer or the host, how hard would it be to reset the transfer back to Requested
and then transition it to Receiving
the moment that data starts flowing?