filesystem_spec
filesystem_spec copied to clipboard
Add ctime/mtime to list of expected values in info
Created and/or modified time is returned in the file info of most backends. We should endeavour to surface these in the file info dict with a common format (datetime.datetime? unix timestamp?) and key names.
e.g.,
--- a/fsspec/implementations/local.py
+++ b/fsspec/implementations/local.py
@@ -78,6 +78,8 @@ class LocalFileSystem(AbstractFileSystem):
result["size"] = out2.st_size
except IOError:
result["size"] = 0
+ result['created'] = datetime.datetime.utcfromtimestamp(result["created"])
+ result['modified'] = datetime.datetime.utcfromtimestamp(result["mtime"])
return result
Marked as "good first issue" because this should be simple per implementation, but there are quite a few implementations to go through.
A list of filesystems and their info keys
I collected some about the .info() dicts of the different filesystems.
Posting it here in case it might be useful:
AbstractFileSystem
"name", "size", "type"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/spec.py#L669-L670
arrow
"name", "size", "type", "mtime" (datetime | float | None)
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/arrow.py#L101-L118
https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileInfo.html#pyarrow.fs.FileInfo
dask
returns whatever the remote fs returns.
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dask.py#L93-L97
data
"name", "size", "type", "mimetype"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/data.py#L31-L35
dbfs
"name", "size", "type"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dbfs.py#L84-L90
dirfs
returns whatever the remote fs returns.
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/dirfs.py#L233-L241
ftp
"name", "size", "type", "modify", "unix.owner", "unix.group", "unix.mode", and other returned via FTP.mlsd()
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/ftp.py#L100-L118
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/ftp.py#L370-L384
git
"name", "size", "type", "hex", "mode" # mode is octal str, hex is str?
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/git.py#L90-L96
github
"name", "size", "type", "sha", "mode" # mode is octal str, sha is str
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/github.py#L167-L178
http
"name", "size", "type", "mimetype", "ETag", "Content-MD5", "Digest"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/http.py#L190-L194
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/http.py#L838-L856
jupyter
"name", "size", "type", "last_modified", "created", "format", "mimetype", "writable"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/jupyter.py#L47-L57
example:
{
"name": "slurm-22382538.out",
"last_modified": "2024-02-09T13:03:30.773865Z",
"created": "2024-02-09T13:03:30.773865Z",
"format": null,
"mimetype": null,
"size": 2896,
"writable": true,
"type": "file"
}
libarchive
"name", "size", "type", "created", "mode", "uid", "gid", "mtime"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/libarchive.py#L165-L172
libarchive mappings:
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/libarchive.py#L145-L153
local
"name", "size", "type", "created", "isLink", "mode", "uid", "gid", "mtime", "ino", "nlink", "destination"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/local.py#L97-L112
memory
"name", "size", "type", "created"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/memory.py#L41-L47
reference
"name", "size", "type"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/reference.py#L224-L235
sftp
"name", "size", "type", "uid", "gid", "time", "mtime"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/sftp.py#L108-L120
smb
"name", "size", "type", "uid", "gid", "time", "mtime"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/smb.py#L168-L176
tar
"name", "size", "type", "mode", "uid", "gid", "mtime", "chksum", "linkname", "uname", "gname", "devmajor", "devminor"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/tar.py#L112-L116
example:
_ = {
'name': 'somefile.md',
'mode': 420,
'uid': 501,
'gid': 20,
'size': 382,
'mtime': 1707314187,
'chksum': 8314,
'type': 'file',
'linkname': '',
'uname': 'andreaspoehlmann',
'gname': 'staff',
'devmajor': 0,
'devminor': 0
}
webhdfs
"name", "size", "type", "accessTime", "blockSize", "group", "modificationTime", "owner", "pathSuffix", "permission", "replication"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/webhdfs.py#L266-L270
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#FileStatus
zip
"name", "size", "type"
https://github.com/fsspec/filesystem_spec/blob/2a8e0ee2b9e8fc4f24e9fa2b257c1599a9e4711a/fsspec/implementations/zip.py#L100-L104
adlfs
"name", "size", "type", "metadata", "creation_time", "deleted", "deleted_time", "last_modified", "content_time", "content_settings", "remaining_retention_days", "archive_status", "last_accessed_on", "etag", "tags", "tag_count", "version_id", "is_current_version"
https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L49-L67
https://github.com/fsspec/adlfs/blob/576fb7a6a53a55375b4458c09e5bb571d945d410/adlfs/spec.py#L829C13-L846
gcsfs
https://cloud.google.com/storage/docs/json_api/v1/objects#resource
https://github.com/fsspec/gcsfs/blob/f526d96860c1422e7b4599b70b267607dae1af8a/gcsfs/core.py#L465-L477
s3fs
"name", "size", "type", "StorageClass", "VersionId", "ContentType", "ETag", "LastModified"
https://github.com/fsspec/s3fs/blob/74f4d95a62d7339a1af12db4339f22c5f3d73670/s3fs/core.py#L1310-L1319
alluxio
"name", "size", "type", "last_modification_time_ms"
https://github.com/fsspec/alluxiofs/blob/33489bcea618d6e934e5227be77be75b5ca105ff/alluxiofs/core.py#L134-L149
wandb
"name", "size", "type", "md5", "mimetype"
https://github.com/jkulhanek/wandbfs/blob/ccc7e4dceb45070de8c440b44ddee96fdd348057/wandbfs/_wandbfs.py#L63-L68
oci
"name", "size", "type", "etag", "md5", "timeCreated", "timeModified", "storageTier", "archivalState"
https://github.com/oracle/ocifs/blob/f0e1d3b7b26bc1c1b010abb11df6cd06ac318ed3/ocifs/core.py#L498-L509
asynclocal
same as local
gdrive
"name", "size", "type", and other returned via ??? https://developers.google.com/drive/api/reference/rest/v3/files#File
https://github.com/fsspec/gdrivefs/blob/8bbfa457605d60d40d2b09c8c93d493cf543100e/gdrivefs/core.py#L157-L160
dropbox
"name", "size", "type", and all public attr from FileMetadata
https://dropbox-sdk-python.readthedocs.io/en/latest/api/files.html#dropbox.files.FileMetadata
https://github.com/fsspec/dropboxdrivefs/blob/23463258eca49c10d77de33e9d07e4ee5caa090c/dropboxdrivefs/core.py#L163-L176
oss
"name", "size", "type", "LastModified"
https://github.com/fsspec/ossfs/blob/016ccbad6b90fe02cf613582bb8db3bb101f4438/src/ossfs/base.py#L186-L199
webdav
"name", "size", "type" and others returned via
_ = {
'name': '/',
'href': '/',
'size': None,
'created': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=tzutc()),
'modified': datetime.datetime(2024, 2, 9, 14, 40, 9, tzinfo=datetime.timezone.utc),
'content_language': None,
'content_type': None,
'etag': None,
'type': 'directory',
'display_name': 'test_storage_options0'
}
https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/fsspec.py#L51-L57
https://github.com/skshetry/webdav4/blob/4c2046e2250f001bdad76541c0e877e4b40c332e/src/webdav4/client.py#L54-L65
dvc
"name", "size", "type", "md5", "md5-dos2unix", "dvc_info", "isdvc", "isout", "fs_info", "isexec", "repo"
https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/fs/dvc.py#L41-L69
root
"name", "size", "type"
https://github.com/CoffeaTeam/fsspec-xrootd/blob/f8c57cd7b0361425ee08a77096dd642ddeb1d987/src/fsspec_xrootd/xrootd.py#L320-L338
box
"name", "size", "type", "id", "modified_at", "created_at"
https://github.com/IBM/boxfs/blob/718fb0071d20a7004f44fe2fa0eac26dc9c3d5d5/src/boxfs/boxfs.py#L395-L402
lakefs
"name", "size", "type", "content-type", "checksum", "mtime"
https://github.com/aai-institute/lakefs-spec/blob/f05c5b6c57547e9f169e3b9c4ed5346f2d65bf35/src/lakefs_spec/spec.py#L356-L363
Thank you, @ap-- , that is very useful. Also worth adding that some backends that don't really have directories will make fake info dicts for those directories, typically with {"name": "...", "size": 0, "type": "dictionary"}.
Your list makes it sound like any FS could do with a add_standard_info_fields(info_dict) static method, where we decide what those standard fields are. For example, converting whatever time unit is expected to a standard representation, which would help for rsync() in particular.
Yes that would be a great step towards standardizing the info_dict.
AbstractFileSystem could even have a default implementation, that tries various different aliases for getting mtime (and potentially others), as well as conversions to the standard datatype (i.e. like this ).
For completeness I'm cross-referencing barneygale/pathlib-abc#3 . I started looking into this, because I need to convert info_dicts into an os.stat_result compatible type for universal_pathlib.
While you're at it, the nanoseconds instead of float times would be good. https://docs.python.org/3/library/os.html#os.stat_result.st_mtime_ns