matorage icon indicating copy to clipboard operation
matorage copied to clipboard

metadata file change from FileStorage to Database

Open graykode opened this issue 5 years ago • 5 comments

Now metadata attribute is managed as a JSON file. However, as a long-term plan, we will modify it to work concurrency with the database.

e.q : data

{
    "additional": {
        "framework": "pytorch",
        "mode": "test"
    },
    "attributes": [
        {
            "itemsize": 0,
            "name": "image",
            "shape": [
                1,
                28,
                28
            ],
            "type": "float32"
        },
        {
            "itemsize": 0,
            "name": "target",
            "shape": [
                1
            ],
            "type": "int64"
        }
    ],
    "compressor": {
        "complevel": 0,
        "complib": "zlib"
    },
    "dataset_name": "mnist",
    "endpoint": "/Users/graykode/shared",
    "filetype": [],
    "indexer": {
        "3335": {
            "length": 3335,
            "name": "tmpuoetuutie1ec9bdf4cb142e8.h5"
        },
        "6670": {
            "length": 3335,
            "name": "tmpzzv9w4r94aac98a99ee74d52.h5"
        },
        "10000": {
            "length": 3330,
            "name": "tmp3qvp1bbtbf74db88d9a0499c.h5"
        }
    }
}

graykode avatar Sep 03 '20 07:09 graykode

How about this ER Diagram? 스크린샷 2020-09-27 오후 2 50 56

  • Bold type : Primary Key
  • Italic type : Foreign key

seongpyoHong avatar Sep 27 '20 05:09 seongpyoHong

@seongpyoHong

There are opinions about several DB types.

  1. The bucket id is hashed here and used as a string type. Therefore, it should be expressed as a string rather than an integer.
  2. The filetype is a list, and even if it is converted to a string, it exceeds at least 255 characters. Therefore, another type of alternative is needed.
  3. additional is a dict type, but like filetype, it exceeds at least 255 characters even when converted to string. Therefore, a type such as Text seems more appropriate.

Additionally, for the attribute name, It's better to use pothole notation(lowercase letters and underbar).

graykode avatar Oct 03 '20 07:10 graykode

@graykode

  • Are there hashed IDs in other tables?
  • I'll change the data type fromvarchar to text and naming convention to pothole notation.

seongpyoHong avatar Oct 03 '20 09:10 seongpyoHong

@seongpyoHong I've fixed ER Diagram like below:

CREATE TABLE bucket (
  id varchar(255) primary key not null,
  additional text not null,
  dataset_name varchar(255) not null,
  endpoint varchar(255) not null,
  compressor varchar(255) not null,
  sagemaker boolean not null default false
);

CREATE TABLE files(
  id serial primary key not null,
  name varchar(255) not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

CREATE TABLE attributes (
  id serial primary key not null,
  name varchar(255) not null,
  type varchar(255) not null,
  shape varchar(255) not null,
  itemsize integer not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

CREATE TABLE indexer (
  id serial primary key not null,
  indexer_end bigint not null,
  length integer not null,
  name varchar(255) not null,
  bucket_id varchar(255),
  constraint bucket_id foreign key (bucket_id) references bucket(id)
);

Since filetype is a list type that is frequently modified, so I decided that it would be better to make this attribute into one table.

As one minor addition, I know that variable-length strings (Text) can slow down the DB. So, how about setting an additional attribute that only uses Text type to very large n, varchar(n)?

graykode avatar Oct 08 '20 09:10 graykode

Versions prior to 0.4.0 manage metadata in json format, so using only s3 could maintain the shape. However, since metadata is managed by the RDBMS, it is necessary to write RDS code for AWS RDMS.

graykode avatar Oct 15 '20 05:10 graykode