rclone icon indicating copy to clipboard operation
rclone copied to clipboard

internetarchive: fix item's metadata support

Open CorentinB opened this issue 1 year ago β€’ 19 comments

What is the purpose of this change?

We want to start using rclone inside of the Internet Archive to upload files!

The goal of this PR is to bring support for item metadata fields for the Internet Archive backend. This is used to set collections, mediatype, item's title, etc.

It's my first time interacting with the rclone codebase, so I hope I didn't miss any obvious ways to do things. Please help me correct my work in case I did! :)

Was the change discussed in an issue or in the forum before?

Not as far as I know.

Checklist

  • [X] I have read the contribution guidelines.
  • [X] I have added tests for all changes in this PR if appropriate.
  • [X] I have added documentation for the changes if appropriate.
  • [X] All commit messages are in house style.
  • [X] I'm done, this Pull Request is ready for review :-)

CorentinB avatar Sep 09 '24 20:09 CorentinB

Metadata support for IA was added via the general rclone metadata framework (rather than a fs-specific feature) a few years back: https://github.com/rclone/rclone/pull/6300

Metadata Docs: https://rclone.org/docs/#metadata

Is there anything missing or wrong in the implementation?

darthShadow avatar Sep 10 '24 07:09 darthShadow

Metadata support for IA was added via the general rclone metadata framework (rather than a fs-specific feature) a few years back: #6300

Metadata Docs: https://rclone.org/docs/#metadata

Is there anything missing or wrong in the implementation?

Thank you. I think I'm confused about how this metadata thing work. πŸ˜…

If I use --metadata-set with e.g. --metadata-set=mediatype=web (or even x-archive-meta-mediatype=web) it doesn't put those metadata on the item. Looking at headers with --dump headers I see that with --metadata-set the metdata k/v aren't in the PUT headers. (where they should be to be effective on IA side when you upload)

This is with the --internetarchive-metadata flags that I added in use: image

And this is if I try to use --metadata-set: image

I (maybe wrongly) assumed that --metadata-set was for file's metadata, and not for "item" metadata. (if that makes sense)

CorentinB avatar Sep 10 '24 09:09 CorentinB

I see that with --metadata-set the metdata k/v aren't in the PUT headers

Just to be sure, you did also specify --metadata in the command to enable the feature?

If so, can you post the complete command?

darthShadow avatar Sep 10 '24 09:09 darthShadow

I see that with --metadata-set the metdata k/v aren't in the PUT headers

Just to be sure, you did also specify --metadata in the command to enable the feature?

If so, can you post the complete command?

No I wasn't, if --metadata is needed then it would be nice if rclone could warn about that when --metadata-set is being used I guess! haha

So, more data, using --metadata indeed the metadata is passed but with X-Amz-Filemeta- prefix, with the IA S3 backend doesn't understand for item's metadata. It needs x-archive-meta-KEY=VALUE. (or x-amz-meta) Finding that led me to find https://forum.rclone.org/t/archive-org-internetarchive-issue-setting-metadata-when-uploading/45210/10, so this is a known issue. Someone in the thread mention to use --header and it would work but that feels as hacky as my solution in this PR now feels :smile:

I guess the real way to solve this would be to fix the way --metadata-set makes the headers in the case of the IA backend. Right? (with x-archive-meta instead of x-amz-filemeta)

What do you think? If that's the way, should I close this PR and open a new one for the fix or use this one?

PS: full command was like: ./rclone copy XXX.warc.gz ia:random_item_name/ -v --stats 1s -P --metadata-set="collection=corentin-barreau-web-archives" --metadata-set="title=Rclone test upload" --metadata-set="mediatype=web" --metadata -vv --dump headers

CorentinB avatar Sep 10 '24 10:09 CorentinB

This is the existing metadata code: https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L787-L820

Since you will probably know the exact headers expected, feel free to update them to the ones used by IA. The same PR can be updated to that fix instead.

darthShadow avatar Sep 10 '24 10:09 darthShadow

This is the existing metadata code: https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L787-L820

Since you will probably know the exact headers expected, feel free to update them to the ones used by IA. The same PR can be updated to that fix instead.

Doing that! Thanks.

Posting for posterity: https://gist.github.com/jjjake/5571850

EDIT: done :)

CorentinB avatar Sep 10 '24 10:09 CorentinB

There are a few more instances that are using the wrong header:

https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L789-L790

https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L611-L617

darthShadow avatar Sep 10 '24 11:09 darthShadow

No I wasn't, if --metadata is needed then it would be nice if rclone could warn about that when --metadata-set is being used I guess! haha

I would agree :-) Want to send a PR?

Since you will probably know the exact headers expected, feel free to update them to the ones used by IA. The same PR can be updated to that fix instead.

Doing that! Thanks.

The PR looks quite simple now which is good!

I would appreciate it if you could look through the other metadata in use and see if it is correct.


One thing I wanted to ask someone at the IA....

We currently run daily integration tests with the IA - see https://pub.rclone.org/integration-tests/current/ and the internetarchive sections

This is essentially doing this with a 2 hour timeout.

cd backend/internetarchive
go test -v -verbose

These tests almost never pass though - the tests take longer than 2 hours :-( Here is a typical failure https://pub.rclone.org/integration-tests/current/internetarchive-backend.internetarchive-TestIArclone-integration-test-1.txt

Do you have a better suggestion for running these tests?

The config looks like this

[TestIA]
type = internetarchive
access_key_id = XXX
secret_access_key = XXX
wait_archive = 10m

And the remote rclone uses is defined here

https://github.com/rclone/rclone/blob/ad122c6f6fb2eb0d2c0cca1c7c7cfa2a454106a8/fstest/test_all/config.yaml#L162-L166

Any ideas on how to improve this?

ncw avatar Sep 10 '24 11:09 ncw

There are a few more instances that are using the wrong header:

https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L789-L790

https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L611-L617

@darthShadow I'm not sure about these. If we want rclone-mtime & rclone-update-track to be set as metadata fields in the item, yes we can change x-amz-filemeta to x-archive-meta but when it comes to the stuff in the Copy it looks like the function tries to set file-level metadata? If yes, that is not supported by IA, we can only have item-level metadata.

I would agree :-) Want to send a PR?

@ncw sure I can try to find the time to do that. :smiley: I'm also looking at your integration tests ASAP, not sure I can solve it but I can look. (I am not in the team that handle IA S3 / items stuff, I am in the Wayback Machine team, archiving the web! :smile:)

Are you happy with these changes now @CorentinB ?

If you all agree with what I just wrote about what's happening in the Copy function, then yes for me, regarding the original goal of the PR, I'm good with the changes. I think I lack knowledge about rclone server-side workflow to really understand what might or might not be working fine with the Copy function.

CorentinB avatar Sep 10 '24 11:09 CorentinB

There are a few more instances that are using the wrong header: https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L789-L790 https://github.com/rclone/rclone/blob/master/backend/internetarchive/internetarchive.go#L611-L617

@darthShadow I'm not sure about these. If we want rclone-mtime & rclone-update-track to be set as metadata fields in the item, yes we can change x-amz-filemeta to x-archive-meta but when it comes to the stuff in the Copy it looks like the function tries to set file-level metadata? If yes, that is not supported by IA, we can only have item-level metadata.

I'm not sure I understand the difference between file-level metadata or item-level metadata?

However if we can't set the file level metadata then setting the item level metadata seems like it might be sensible - what do you think @CorentinB ? Otherwise that metadata is just disappearing into the ether, never to be seen again.

I would agree :-) Want to send a PR?

@ncw sure I can try to find the time to do that. πŸ˜ƒ

:+1:

I'm also looking at your integration tests ASAP, not sure I can solve it but I can look. (I am not in the team that handle IA S3 / items stuff, I am in the Wayback Machine team, archiving the web! πŸ˜„)

Thanks for taking a look :-) And cool job :-)

Are you happy with these changes now @CorentinB ?

If you all agree with what I just wrote about what's happening in the Copy function, then yes for me, regarding the original goal of the PR, I'm good with the changes. I think I lack knowledge about rclone server-side workflow to really understand what might or might not be working fine with the Copy function.

And I think I lack the corresponding understanding of IA! But take a look over what I wrote above and tell me what you think.

ncw avatar Sep 10 '24 15:09 ncw

I'm not sure I understand the difference between file-level metadata or item-level metadata?

https://archive.org/details/ellibrodesancipr00surf -> ellibrodesancipr00surf is an item. Inside you can upload files, it's a directory basically.

x-archive-meta allow to set metadata on the items, the type of the items for example, web, data, movies.. The mediatype metadata value of an item for example will impact which derive process will run when the item is uploaded or updated. e.g. you upload an mkv file in an item with mediatype=movies, it will automatically generate mp4 files derived on this mkv file.

For a concrete example, x-archive-meta-rclone-mtime & x-archive-meta-rclone-update-track, those will have different values for every file / PUT request, but these are not metadata for a file, these are for the item. So on each PUT, the rclone-update-track metadata field of the ITEM will be updated.

All of that for saying that at the Internet Archive, everything is items, and we can put many kind of metadata that have many consequences, on ITEMS. Files in an item can't have different mediatype for example.

https://archive.org/developers/items.html https://archive.org/developers/metadata-schema/index.html

Note again: I don't know 100% of our own system, so that's why I'm cautious to not touch the Copy function too much. The person that implemented it that way maybe had a reason but I lack knowledge to modify it myself. I don't even know if it works right now. I will try to figure out some time in the near future to find the best person at IA to help me verify that Copy code, someone that know our S3 system perfectly well :smile:

I hope I'm clear enough. TL;DR is that I think that PR is fine for now and can be merged if it's OK for you, but someone (me or someone else) need to take a closer look at Copy in the future.

CorentinB avatar Sep 10 '24 16:09 CorentinB

I just reflected again on what I wrote and understood of Copy and I want to add that I mistakenly thought that Copy was trying to put individual metadata on files, I don't think it does that right now my bad, I've misread the code.

I think I just don't really understand what Copy is for. (compared to Update)

NB/edit: doesn't change the fact that I think this PR can be merged, but work is left to be done on understanding and possibly fixing of the Copy function.

CorentinB avatar Sep 10 '24 16:09 CorentinB

Thanks for your explanation @CorentinB

I'm not sure I understand the difference between file-level metadata or item-level metadata?

https://archive.org/details/ellibrodesancipr00surf -> ellibrodesancipr00surf is an item. Inside you can upload files, it's a directory basically.

So if I think of items as being directories of files then hopefully that won't be too far off.

What Is an Item? Archive.org is made up of β€œitems”. An item is a logical β€œthing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata. If the files in an item have separate metadata, the files should probably be in different items. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Every item has an identifier that is unique across archive.org.

Assuming that is correct, then rclone normally only acts on files and it would expect metadata to only apply to the file in question.

So I'd say using x-archive-filemeta is probably correct in rclone terms.

Whether that makes sense in IA terms I'm not so sure.

And does archive-filemeta actualy exist? I can't find mention of it anywhere in the docs.

x-archive-meta allow to set metadata on the items, the type of the items for example, web, data, movies.. The mediatype metadata value of an item for example will impact which derive process will run when the item is uploaded or updated. e.g. you upload an mkv file in an item with mediatype=movies, it will automatically generate mp4 files derived on this mkv file.

For a concrete example, x-archive-meta-rclone-mtime & x-archive-meta-rclone-update-track, those will have different values for every file / PUT request, but these are not metadata for a file, these are for the item. So on each PUT, the rclone-update-track metadata field of the ITEM will be updated.

OK, those need to be for the file not the item for rclone to work properly.

All of that for saying that at the Internet Archive, everything is items, and we can put many kind of metadata that have many consequences, on ITEMS. Files in an item can't have different mediatype for example.

https://archive.org/developers/items.html https://archive.org/developers/metadata-schema/index.html

Note again: I don't know 100% of our own system, so that's why I'm cautious to not touch the Copy function too much. The person that implemented it that way maybe had a reason but I lack knowledge to modify it myself. I don't even know if it works right now. I will try to figure out some time in the near future to find the best person at IA to help me verify that Copy code, someone that know our S3 system perfectly well πŸ˜„

I hope I'm clear enough. TL;DR is that I think that PR is fine for now and can be merged if it's OK for you, but someone (me or someone else) need to take a closer look at Copy in the future.

Now I'm not so sure the PR is correct!

I just reflected again on what I wrote and understood of Copy and I want to add that I mistakenly thought that Copy was trying to put individual metadata on files, I don't think it does that right now my bad, I've misread the code.

I think I just don't really understand what Copy is for. (compared to Update)

What Copy does is do a server side copy of a file. So if you had uploaded a file to an item you could copy it to a new name in the same item or a different item (hopefully I have the hang of the terminology now!).

NB/edit: doesn't change the fact that I think this PR can be merged, but work is left to be done on understanding and possibly fixing of the Copy function.

What I don't understand is why you'd want to change the item metadata for each file that gets uploaded? Maybe there is normally only one file in an item?

If we are treating an item as a directory maybe we should add the metadata at the time it is created? Though I don't think that is how the API works is it?

As far as the PR is concerned, I now think changing x-archive-filemeta-rclone* to x-archive-meta-rclone* is incorrect changing those will break rclone syncing I think. Rclone needs individual files to have their own timestamp so it can know if it has synced them.

I am less sure about the change in Update though.

Changing headers[fmt.Sprintf("x-amz-filemeta-%s", mk)] = mv to headers[fmt.Sprintf("x-archive-meta-%s", mk)] = mv will mean that what rclone thinks is file based metadata is set on the item.

This would definitely cause the integration tests to fail (if they were working!) as rclone expects to be able to read the metadata on files that it set.

However is this what users actually want?

For each file uploaded to set the item metadata? From my reading of the IA docs I think it probably is.

Maybe we do need a different way of setting item metadata? Something almost exactly like your original PR! Then we can explain that the metadata support is for file metadata.

How do you read item metadata? Does it come back when you HEAD a file? If not then maybe we need a way of reading it too.

It perhaps might make more sense in terms of rclone to pretend the item metadata is metadata belonging to the directory that the files are in.

What do you think @CorentinB ?

ncw avatar Sep 10 '24 17:09 ncw

Ok @ncw I just had a great chat with a colleague, he explained many things to me about the quirks of IA S3 implementation.

First: file-level metadata exist, and are working, so the rclone-specific metadata fields that rclone already sets -> this is going to the file's metadata. ALTHOUGH, the file-level metadata is NOT accessible through IA's metadata API but it is available when doing a HEAD request on the file using the S3 path (s3.archive.org). That is one of the quirks.

Also, one very important thing, any metadata set via headers is actually file-level first, BUT, if the item doesn't exist, the item is created (except if x-archive-auto-make-bucket = 0) AND the metadata is used for the item. (if the header is x-archive-meta-XXX)

So I think my PR's original code was actually fine. I think it would be good to have the 2 being distinct. And I should revert my changes to the x- prefix of the values in --metadata-set.

I also talked about the Copy function, so I'm not an S3-proficient person and he explained to me that basically, the COPY directive has its quirks too at IA, basically you can only do a COPY if the source item is public. If an item is public or not can be checked via the metadata API.

I apologize for my previous own misunderstanding about how we handle file-level metadata at IA :smile:

What should we do @ncw?

CorentinB avatar Sep 10 '24 19:09 CorentinB

First: file-level metadata exist, and are working, so the rclone-specific metadata fields that rclone already sets -> this is going to the file's metadata. ALTHOUGH, the file-level metadata is NOT accessible through IA's metadata API but it is available when doing a HEAD request on the file using the S3 path (s3.archive.org). That is one of the quirks.

That makes sense. Rclone is treating the backend as an almost S3 backend.

Also, one very important thing, any metadata set via headers is actually file-level first, BUT, if the item doesn't exist, the item is created (except if x-archive-auto-make-bucket = 0) AND the metadata is used for the item. (if the header is x-archive-meta-XXX)

Got it. I think we set x-archive-auto-make-bucket = 1.

So I think my PR's original code was actually fine. I think it would be good to have the 2 being distinct. And I should revert my changes to the x- prefix of the values in --metadata-set.

:+1:

I also talked about the Copy function, so I'm not an S3-proficient person and he explained to me that basically, the COPY directive has its quirks too at IA, basically you can only do a COPY if the source item is public. If an item is public or not can be checked via the metadata API.

OK

I apologize for my previous own misunderstanding about how we handle file-level metadata at IA πŸ˜„

No worries. There is a confusing mixture of terminology here!

Rclone IA
Metadata File Metadata
Bucket/Directory Item
Object/File File
? Metadata

Your PR is filling in the ?!

What should we do @ncw?

I did a review of your original commit with some suggestions. I suggest you drop the other commits, edit the commit then force push this branch so it is back to one commit.

I think this is going to need lots of docs :-)

ncw avatar Sep 10 '24 19:09 ncw

I think this also requires the rename of the x-amz-filemeta headers to x-archive-filemeta or does IA support them even with the x-amz prefix?

darthShadow avatar Sep 11 '24 10:09 darthShadow

I think this also requires the rename of the x-amz-filemeta headers to x-archive-filemeta or does IA support them even with the x-amz prefix?

No it's fine like that! e.g. https://ia802305.us.archive.org/30/items/test_rclone_upload_09_10_2024-2/test_rclone_upload_09_10_2024-2_files.xml

@ncw thanks I'm almost done with the changes. I just commented on one of your comments. LMK and then I'll push everything for a final review I hope :)

Edit: I also changed the default behaviour so that it triggers item derive, it was set to false previously. I added a flag to disable derive. I think it's better if it derives by default, it's what most people would expect I think. (when you upload a file via archive.org web UI, it derives)

CorentinB avatar Sep 11 '24 11:09 CorentinB

Also, I've been at IA for more than 5 years and I hope to stay there for many many more years, so if you want, feel free to add me as the maintainer of the IA backend on https://github.com/rclone/rclone/blob/master/MAINTAINERS.md. My plan is to use rclone at scale (as a Go module though) to upload to IA, inside IA. I'll do my best to enhance the backend more when I find time in the next coming weeks/months.

CorentinB avatar Sep 11 '24 12:09 CorentinB

Hi @ncw, do you happen to know when you will have the chance to review these changes please?

CorentinB avatar Sep 13 '24 15:09 CorentinB

Hi @ncw,

I think I beat the record for delay here :) Lot of things kept me very busy for the past weeks, I deeply apologize.

I think your idea is very acceptable, I don't see this being used in the config file, at all.

CorentinB avatar Dec 03 '24 11:12 CorentinB

No worries! Do you think this is ready to merge now?

ncw avatar Dec 03 '24 18:12 ncw

No worries! Do you think this is ready to merge now?

@ncw I think it is! :)

CorentinB avatar Dec 03 '24 23:12 CorentinB

Hi @ncw, do you think you can merge this PR now? Thanks!

CorentinB avatar Dec 09 '24 10:12 CorentinB

Hi @ncw, any chance this can be merged? Thanks!

CorentinB avatar Dec 31 '24 16:12 CorentinB

I think this looks fine now. Many apologies for dropping the ball with this PR @CorentinB

I will merge now.

Many thanks @ncw, and no worries, I know you are very busy. :)

CorentinB avatar Jan 17 '25 15:01 CorentinB