w3up Verify pieces are CAR before aggregation

Boost assumes that aggregate pieces are CARs

https://github.com/filecoin-project/boost/blob/baf26c6dc883dade04947ff4632eb1e664fb55b9/storagemarket/deal_commp.go#L148-L154

However currently we do not verify that uploaded CAR is actually a valid CAR, we instead leverage S3 / R2 presigned URLs to verify that uploaded file matches sha256 in the CAR cid.

As it turns out some users have uploaded non CAR files through a CAR upload pipeline, resulting in mistagged CIDs.

We should either start verifying that uploaded content is a CAR, or perhaps use raw blocks instead so content is not mistagged. If we do later we also need to decide what to do when putting pieces into an aggregate as they may not be CARs which would violate assumptions in boost.

Feb 06 '24 18:02 Gozala

@vasco-santos It occured to me that we could validate that input to piece CID derivation is a valid CAR since we have to read CAR bytes anyway to derive piece CID. If we discover that piece is not a CAR we could error, mark content invalid and take it out of queue.

Alternatively we could attempt something more complicated and if piece payload is not a CAR simply prepend a CAR header + content CID and derive piece from there. If we store that information somewhere in content claims we could also make a CAR on demand when serving. This option does seem very complicated however and probably not worth the effort. People really should not be putting non-cars into store anyway.

Feb 07 '24 20:02 Gozala

It occured to me that we could validate that input to piece CID derivation is a valid CAR since we have to read CAR bytes anyway to derive piece CID. If we discover that piece is not a CAR we could error, mark content invalid and take it out of queue.

Yes, we could check that there today. If we would move to a decentralized run of piece computation, where this would run in a Filecoin Station for example, this would not be possible anymore. Second option would also be problematic, given a computation workload should not have side effects. Therefore, I would like to avoid that direction and discuss first how strong we care about not storing non CARs into aggregates, before making it difficult to decentralize these bits of infra.

I was thinking more about checking the CAR cid on store/add, with that we could already get rid of part of "undesirable" uploads if CID does not have the right codes for what we want

I still think it is ok to aggregate non-CARs, so maybe we should align as a team first if that is or not desirable

Feb 08 '24 07:02 vasco-santos

Yes, we could check that there today. If we would move to a decentralized run of piece computation, where this would run in a Filecoin Station for example, this would not be possible anymore. Second option would also be problematic, given a computation workload should not have side effects. Therefore, I would like to avoid that direction and discuss first how strong we care about not storing non CARs into aggregates, before making it difficult to decentralize these bits of infra.

I think it depends how you define the interface. If interface is "compute piece cid for the bytes" I agree it is problematic. However if interface is "compute piece cid for the CAR", I would argue that passing anything but CAR should fail that computation whether it runs in station or or elsewhere.

There are no side effects piece CID computation should always fail if input invariant is violated.

I was thinking more about checking the CAR cid on store/add, with that we could already get rid of part of "undesirable" uploads if CID does not have the right codes for what we want

I actually would like to remove the assumption that we take CARs. In fact directionally I think we all want to move to a system where we don't turn files into CARs which we then index to assemble files back. Instead we want to take files and optionally index them if we want to support block level reads. I'll write more on this in a separate issue.

I still think it is ok to aggregate non-CARs, so maybe we should align as a team first if that is or not desirable

I think this is not our decision to make. Regardless though we need to fix our assumptions which currently are invalid. We assume that store/add takes CARs and that aggregate builder aggregates from pieces that are bytes. I think we should change them around, that is drop assumption that bytes stored are CARs and in aggregation impose invariant that pieces are derived from CARs not from bytes.

On a final note if we do fix our assumptions we also will have alternative way to deal with the problem, specifically we'll know which one is the piece derived from CAR payload and which one is piece derived from arbitrary bytes. We could retain that info and consider it in the piece sorting algorithm so that aggregate will start with a CAR.

That said, I personally don't think we should force non CARs into filecoin pipeline if there is a resistance (which there seems to be). I would instead collect evidence of negative effects of that resistance so opinions can be changed and if they do we could trivially remove invariant that payload is CAR.

Feb 08 '24 17:02 Gozala