bety icon indicating copy to clipboard operation
bety copied to clipboard

dbfiles: Add constraints for two candidate keys

Open gsrohde opened this issue 10 years ago • 16 comments

  • [x] Correct uniqueness violations
  • [x] Correct not-NULL violations
  • [ ] Add uniqueness constraints
  • [x] Decide what, if any, normalization and value constraints to use
    • no whitespaces, file paths / names should look like file paths / names (e.g. implement regex)
  • [ ] Add normalization and/or value constraints - READY TO BE DONE
  • [ ] Add not-NULL constraints - READY TO BE DONE

The two proposed uniqueness constraints in lines 1300-1301 of the gist at https://gist.github.com/dlebauer/59d9e44ec29838259fa4 are

ALTER TABLE dbfiles ADD CONSTRAINT unique_filename_and_path_per_machine UNIQUE (file_name, file_path, machine_id);
ALTER TABLE dbfiles ADD UNIQUE (container_type, container_id);

(I've left out quotes and schema qualifiers where unneeded.)

There are currently no violations of the first constraint on ebi_production, as this query returns 0 rows:

SELECT ARRAY_AGG(id), file_name, file_path, machine_id FROM dbfiles GROUP BY file_name, file_path, machine_id HAVING COUNT(*) > 1;

There are, however, many violations of the second constraint:

SELECT ARRAY_AGG(id), container_type, container_id FROM dbfiles GROUP BY container_type, container_id HAVING COUNT(*) > 1;

returns 65 rows, some representing as many as 26 rows of the dbfiles table. This leads me to question whether this is really the intended candidate key, or whether an additional column was meant to be included.

Normalization and/or value constraints

These queries:


SELECT file_name FROM dbfiles WHERE file_name ~ ' \s';
SELECT file_path FROM dbfiles WHERE file_path ~ '\s';

show that currently, no file_name or file_path value contain whitespace. If we wish to keep it this way, we can add the following check constraints:

ALTER TABLE dbfiles ADD CONSTRAINT no_space_in_file_name CHECK (NOT file_name ~ '\s');
ALTER TABLE dbfiles ADD CONSTRAINT no_space_in_file_path CHECK (NOT file_path ~ '\s');

To-do: Decide if these should be constraints or if perhaps we should use even more stringent constraints. We could check file_path against some regular expression, for example, to ensure that it really looks like a file path. For example, currently all file_path values match '^([a-z.-]+:)?//?([.a-z0-9]+/)+[.A-Za-z0-9_-]*/?$'. (SELECT file_path FROM dbfiles WHERE NOT file_path ~ '^([a-z.-]+:)?//?([.a-z0-9]+/)+[.A-Za-z0-9_-]*/?$'; returns zero rows.)

The other textual field involved in the uniqueness constraints is container_type. It currently has one of three values:

select distinct container_type from dbfiles;
 container_type 
----------------
 Model
 Posterior
 Input
(3 rows)

If this set is relatively stable, we could either (1) define an ENUM type to use for this column; (2) add the following check constraint:

ALTER TABLE dbfiles ADD CONSTRAINT valid_container_type CHECK (container_type IN ('Model', 'Posterior', 'Input'));

Not-NULL violations

The following queries show that only file_path and container_id columns sometimes contain NULL:

SELECT * FROM dbfiles WHERE file_name IS NULL;
SELECT * FROM dbfiles WHERE file_path IS NULL;
SELECT * FROM dbfiles WHERE machine_id IS NULL;
SELECT * FROM dbfiles WHERE container_type IS NULL;
SELECT * FROM dbfiles WHERE container_id IS NULL;

If there are cases where file_path is not relevant, the empty string could be used instead:

UPDATE dbfiles SET file_path = '' WHERE file_path IS NULL;

Non-NULL constraints


ALTER TABLE dbfiles ALTER COLUMN file_name SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN file_path SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN machine_id SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN container_type SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN container_id SET NOT NULL;

gsrohde avatar Jan 23 '15 23:01 gsrohde

  • I replaced null values of file_path in dbfiles with an empty string

@robkooper @mdietze

  • the remaining null container_id violation here is https://www.betydb.org/dbfiles/215 -
  • some of the records in dbfiles with a file path that starts /usr/local/ebi/paperclip/files/ have auto-generated file names that are included at the end of the file_path field. The file_name field contains readable text, but does not actually refer to the file name.
    • for dbfiiles.id = 215, the file_name is umb.clm and the file.path is /usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55.
more /usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55
  • are we actively using files in the paperclip directory? Should these perhaps be renamed and placed somewhere else?
  • note that umb.clm and US-UMB.CLM (dbfiles.id = 215 and 209) are similarly named but have different contents

dlebauer avatar Jan 26 '15 18:01 dlebauer

id 215: It is unclear to me from this message what the "violation" is in this record. It has a machine, file path, and file name. Is there something wrong with the values for any of these?

Paperclip: Yes, we still want those files. I'm fine if you migrate them to another location on the server, so long as you update all the file paths, but is that really worth the effort? (i.e. is the paperclip folder a problem somehow?)

UMB clim files: why is having two different files having SIMILAR (but different) names a problem?

mdietze avatar Jan 26 '15 20:01 mdietze

It is unclear to me from this message what the "violation" is in this record.

dbfiles.id = 215 has no container_id, so it violates a not-null constraint

Paperclip: Yes, we still want those files

The issue with the paperclip files is that the values in the file_path and file_name fields are incorrect. Paperclip gave a random name to each file, though this name is currently under `file_path This could be corrected by either a) renaming the file or b) updating the records, e.g.

update dbfiles 
  set file_name = '8e0fe0cd540da70c1803e9d001d5ff55', 
  file_path = '/usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55',
  updated_at = now() 
where id = 215

This should be okay, as long as the dbfiles.file_name field is not used to identify the file. If we use the 'inputs.name' field or other information to identify the file.

UMB clim files: why is having two different files having SIMILAR (but different) names a problem?

Its not a problem - I was only noting that these are not duplicates.

dlebauer avatar Jan 26 '15 21:01 dlebauer

id 215: Went through all the input records and I believe that the record associated with this was probably deleted so this dbfile record can be deleted

paperclip: I didn't realize the last bit in the PATH was the file name, not part of the path (that's not how that record is supposed to be used). I think it would be much better to change the file name back to the original (what's in the filename field) than to change the database record to the same random string that @mulroony generated. I think Pat's original code was going to give the user the file back with the original name when the file was requested.

mdietze avatar Jan 26 '15 21:01 mdietze

delete from dbfiles where id = 215;

files copied to /home/share/data/dbfiles/ and database updated (see gist)

@mdietze the /usr/local/ebi/paperclip/ does indeed have two directories - one file_names and the other files. There is also raws. In any case, redundant with what we have in dbfiles, and obsolete (the paperclip gem moved on).

dlebauer avatar Jan 26 '15 22:01 dlebauer

@gsrohde this should be ready to implement. I've updated the tasks

dlebauer avatar Jan 26 '15 22:01 dlebauer

@dlebauer Are we dispensing with this constraint:

ALTER TABLE dbfiles ADD UNIQUE (container_type, container_id);

As I mentioned, there are many violations.

gsrohde avatar Jan 27 '15 16:01 gsrohde

I somehow missed that you were adding the constraint that the container_type and container_id pair be unique. That constraint SHOULD BE DROPPED since it counter to the database design. The design of the dbfiles table REQUIRES that these be non-unique since they point of the dbfiles table is to allow a 1:many relationship between the "container" and the individual files. For example, a single Input record may refer to many instances of the same identical individual file, which, for example, may be located on different machines. Likewise, an individual Input record may refer to multiple different files.

mdietze avatar Jan 27 '15 17:01 mdietze

@dlebauer Is there ever a case where the same file will be referenced more than one in this table, such as the same file stored in two different locations? If not, we can also require uniqueness for md5.

gsrohde avatar Feb 09 '15 16:02 gsrohde

md5 won't be unique because the files won't be unique (indeed md5 was added to ensure that two files are identical, not unique). The same files will often be stored in multiple locations. I can't see any sensible uniqueness constraints on the dbfiles table other than two rows should not be identical in all entries. Is that even worth the effort?

mdietze avatar Feb 09 '15 17:02 mdietze

The first column set mentioned -- (file_name, file_path, machine_id) -- should be unique. Unless perhaps a given file can be associated with multiple containers.

gsrohde avatar Feb 10 '15 15:02 gsrohde

that sounds reasonable, though file_name should be able to be either NULL or default to blank

mdietze avatar Feb 10 '15 15:02 mdietze

What does it mean for file_name to be blank or NULL? There currently is only one case of this.

gsrohde avatar Feb 12 '15 06:02 gsrohde

if it is a directory that is referred to the filename can be blank.

robkooper avatar Feb 12 '15 13:02 robkooper

So if there is more than one file in that directory, how will you know which one it is? Or will it just be assumed there is exactly one file in the directory if file_name is blank? More to the point, what is the use case for leaving file_name blank?

gsrohde avatar Feb 13 '15 04:02 gsrohde

The entry would be referring to ALL the files in the directory, not a single file. Even when file_name isn't blank it is often referring to multiple files and file_name is actually just the common prefix to all the files. A good example would be a GIS shapefile, which is actually 4 files, each with the same prefix but with different extensions. Another common example in PEcAn is meteorology, where a folder might contain files that look like PREFIX_YEAR.nc where there is a common prefix and extension but one file per year rather than the data being all in one file. A met folder might commonly contain a few years (Ameriflux), many decades (reanalysis), or even millennia (PalEON)

mdietze avatar Feb 13 '15 04:02 mdietze