metadata file change from FileStorage to Database
Now metadata attribute is managed as a JSON file. However, as a long-term plan, we will modify it to work concurrency with the database.
e.q : data
{
"additional": {
"framework": "pytorch",
"mode": "test"
},
"attributes": [
{
"itemsize": 0,
"name": "image",
"shape": [
1,
28,
28
],
"type": "float32"
},
{
"itemsize": 0,
"name": "target",
"shape": [
1
],
"type": "int64"
}
],
"compressor": {
"complevel": 0,
"complib": "zlib"
},
"dataset_name": "mnist",
"endpoint": "/Users/graykode/shared",
"filetype": [],
"indexer": {
"3335": {
"length": 3335,
"name": "tmpuoetuutie1ec9bdf4cb142e8.h5"
},
"6670": {
"length": 3335,
"name": "tmpzzv9w4r94aac98a99ee74d52.h5"
},
"10000": {
"length": 3330,
"name": "tmp3qvp1bbtbf74db88d9a0499c.h5"
}
}
}
How about this ER Diagram?

- Bold type : Primary Key
- Italic type : Foreign key
@seongpyoHong
There are opinions about several DB types.
- The
bucket idis hashed here and used as a string type. Therefore, it should be expressed as a string rather than an integer. - The
filetypeis a list, and even if it is converted to a string, it exceeds at least 255 characters. Therefore, another type of alternative is needed. additionalis a dict type, but like filetype, it exceeds at least 255 characters even when converted to string. Therefore, a type such as Text seems more appropriate.
Additionally, for the attribute name, It's better to use pothole notation(lowercase letters and underbar).
@graykode
- Are there hashed IDs in other tables?
- I'll change the data type from
varchartotextand naming convention to pothole notation.
@seongpyoHong I've fixed ER Diagram like below:
CREATE TABLE bucket (
id varchar(255) primary key not null,
additional text not null,
dataset_name varchar(255) not null,
endpoint varchar(255) not null,
compressor varchar(255) not null,
sagemaker boolean not null default false
);
CREATE TABLE files(
id serial primary key not null,
name varchar(255) not null,
bucket_id varchar(255),
constraint bucket_id foreign key (bucket_id) references bucket(id)
);
CREATE TABLE attributes (
id serial primary key not null,
name varchar(255) not null,
type varchar(255) not null,
shape varchar(255) not null,
itemsize integer not null,
bucket_id varchar(255),
constraint bucket_id foreign key (bucket_id) references bucket(id)
);
CREATE TABLE indexer (
id serial primary key not null,
indexer_end bigint not null,
length integer not null,
name varchar(255) not null,
bucket_id varchar(255),
constraint bucket_id foreign key (bucket_id) references bucket(id)
);
Since filetype is a list type that is frequently modified, so I decided that it would be better to make this attribute into one table.
As one minor addition, I know that variable-length strings (Text) can slow down the DB. So, how about setting an additional attribute that only uses Text type to very large n, varchar(n)?
Versions prior to 0.4.0 manage metadata in json format, so using only s3 could maintain the shape. However, since metadata is managed by the RDBMS, it is necessary to write RDS code for AWS RDMS.