filecoin-plus-large-datasets icon indicating copy to clipboard operation
filecoin-plus-large-datasets copied to clipboard

[DataCap Application] <FogMeta Lab> - <Open Datasets on AWS - bioinformatics>[1/2]

Open hengdingy opened this issue 2 years ago • 84 comments

Data Owner Name

FogMeta Lab

Data Owner Country/Region

China

Data Owner Industry

Web3 / Crypto

Website

https://fogmeta.com

Social Media

Twitter: https://twitter.com/FogMeta
GitHub: https://github.com/FogMeta

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1gmkpkvrsxvwveyfe3c3y3xejw6flbdrowc5jv6i

Custom multisig

  • [ ] Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

FogMeta Lab's research spans multiple levels from system technology, infrastructure, and middleware to services and solutions, and involves future systems, network technology and business, distributed systems and management, information management, and interactive and innovative services. Based on the views on and practices in the industry, FogMeta also solves the problem of business complexity through operations optimization and other technologies.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

These are all open datasets of the same subset(the "bioinformatics" category) on AWS. Please refer to the link here: https://registry.opendata.aws/tag/bioinformatics/.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus, graphsplit, others/custom tool

If you answered "other/custom tool" in the previous question, enter the details here

We'd also like to use the Swan Client tool (https://github.com/filswan/go-swan-client#Graphsplit) to prepare the dataset.

Please share a sample of the data

1. 4D Nucleome (4DN)
s3://4dn-open-data-public/

2. Genome Aggregation Database (gnomAD)
s3://gnomad-public-us-east-1/

3. The Singapore Nanopore Expression Data Set
s3://sg-nex-data/

4. PubSeq - Public Sequence Resource
s3://pubseq-datasets/

5. Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)
s3://targetepigenomics/

6. Open Bioinformatics Reference Data for Galaxy
s3://biorefdata/

7. Basic Local Alignment Sequences Tool (BLAST) Databases
s3://ncbi-blast-databases/

8. Broad Genome References
s3://broad-references/

9. DNAStack COVID19 SRA Data
s3://dnastack-covid-19-sra-data/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

  • [X] I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Monthly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, IPFS, Shipping hard drives, Others

How do you plan to choose storage providers

Slack, Partners, Others

If you answered "Others" in the previous question, what is the tool or platform you plan to use

We'd also like to use FilSwan platform (https://filswan.com/) to choose storage providers who meet our requirements.

If you already have a list of storage providers to work with, fill out their names and provider IDs below

The storage providers we'd like to work with are presented below. Some of them are from the FilSwan platform.
f01955033
f02029115
f03624
f010088
f02301
f08399
f02401
f01955030
f0187709
f01163272
f01402814
f01390330
f01225882
f0717969
f03223
f01395673
f01072221
f0143858
f01786736
f0836160
f032824
f01443744
f01871352
f01907556
f01955028
f01947280
f01946551
f02012951
f01970630
f0240185

How do you plan to make deals to your storage providers

Boost client, Lotus client, Others/custom tool

If you answered "Others/custom tool" in the previous question, enter the details here

Swan Client tool
https://github.com/filswan/go-swan-client

Can you confirm that you will follow the Fil+ guideline

Yes

hengdingy avatar Feb 22 '23 10:02 hengdingy

So this would be a matter of enhancing the go-data-transfer protocol with an extra signalling message, potentially: https://github.com/filecoin-project/go-data-transfer/blob/ad43f2d453f12b8f32663c6e25c3fdb6c042aabf/message/message.go

raulk avatar Aug 19 '21 12:08 raulk

I agree that we should improve the UX for outputing the state of the transfer.

As I understand it the current behaviour is:

  1. Graphsync client requests data Transfer moves to "Ongoing" state
  2. Graphsync server can process a limited number of concurrent requests, so it queues up the request to be processed later
  3. Client reports the state as "Ongoing" even though it hasn't started yet

If possible it would be nice to keep the protocol simple and avoid adding implementation-specific messages about queueing. I'd suggest instead that we just change the naming of our states in any output to the user:

  • When the client sends a request output the state as "Queued"
  • When the first byte is received output the state as "Ongoing"

Note that this would be a cosmetic change to the output, it doesn't require any logic changes in how we transition between states / fire events etc.

dirkmc avatar Aug 23 '21 08:08 dirkmc

@dirkmc So while I agree that the simpler solution would immediately improve UX, from the correctness/safety perspective there is a difference between an explicit and implicit queued event.

  • The original post talks about an explicit event, which acts like an ACK, signalling to the requester that the transfer request has been accepted, but it's queued due to throttling directives.
  • The simplified solution guesses that a zero size transfer

Also, I think the simplified solution would end up being confusing/insufficient in the face of restarts after a partial transfer. Here, the transferred amount will be non-zero, but the request may as well end up being queued on the responder side, which means that the peer will think it's ongoing when it isn't :-(

There are probably other edge cases like this, which is why I prefer state machines not to make assumptions and instead rely on explicit signalling of states between peers.

raulk avatar Aug 25 '21 09:08 raulk

I agree that it's nice for the caller to have feedback on the state of their request. I suggest we think about it in terms of the trade-off between keeping the user informed and keeping the protocol simple. Each addition to the protocol needs to be supported by all future implementations.

As an example when a client opens an HTTP request, the request is implicitly added to a queue. The server can signal that it's overloaded with the response code "too many requests", but I don't believe there's anything like "I've queued your request".

dirkmc avatar Aug 25 '21 11:08 dirkmc

HTTP is designed to be synchronous though, so the client has an expectation that the response will be almost immediate (and/or the server is starting work immediately). However, the graphsync/data-transfer protocol flows are asynchronous by design, and transfers could stay queued up for hours, which is one of the underlying reasons that this explicit state is necessary IMO.

If we do want to stick to the two-state approach, I would suggest different naming:

  • Queued => Requested (client doesn't make assumptions about queueing, all it knows is that it sent the request)
  • Ongoing => Receiving

However, with this two-state approach there is still the restart case, which will be relatively frequent and therefore should be non-ambiguous.

raulk avatar Aug 25 '21 11:08 raulk

Agree with changing the naming, I think it's confusing at the moment 👍

dirkmc avatar Aug 25 '21 12:08 dirkmc

@dirkmc On a restart of the transfer or the host, how hard would it be to reset the transfer back to Requested and then transition it to Receiving the moment that data starts flowing?

raulk avatar Aug 25 '21 13:08 raulk