bety
bety copied to clipboard
dbfiles: Add constraints for two candidate keys
- [x] Correct uniqueness violations
- [x] Correct not-NULL violations
- [ ] Add uniqueness constraints
- [x] Decide what, if any, normalization and value constraints to use
- no whitespaces, file paths / names should look like file paths / names (e.g. implement regex)
- [ ] Add normalization and/or value constraints - READY TO BE DONE
- [ ] Add not-NULL constraints - READY TO BE DONE
The two proposed uniqueness constraints in lines 1300-1301 of the gist at https://gist.github.com/dlebauer/59d9e44ec29838259fa4 are
ALTER TABLE dbfiles ADD CONSTRAINT unique_filename_and_path_per_machine UNIQUE (file_name, file_path, machine_id);
ALTER TABLE dbfiles ADD UNIQUE (container_type, container_id);
(I've left out quotes and schema qualifiers where unneeded.)
There are currently no violations of the first constraint on ebi_production, as this query returns 0 rows:
SELECT ARRAY_AGG(id), file_name, file_path, machine_id FROM dbfiles GROUP BY file_name, file_path, machine_id HAVING COUNT(*) > 1;
There are, however, many violations of the second constraint:
SELECT ARRAY_AGG(id), container_type, container_id FROM dbfiles GROUP BY container_type, container_id HAVING COUNT(*) > 1;
returns 65 rows, some representing as many as 26 rows of the dbfiles table. This leads me to question whether this is really the intended candidate key, or whether an additional column was meant to be included.
Normalization and/or value constraints
These queries:
SELECT file_name FROM dbfiles WHERE file_name ~ ' \s';
SELECT file_path FROM dbfiles WHERE file_path ~ '\s';
show that currently, no file_name or file_path value contain whitespace. If we wish to keep it this way, we can add the following check constraints:
ALTER TABLE dbfiles ADD CONSTRAINT no_space_in_file_name CHECK (NOT file_name ~ '\s');
ALTER TABLE dbfiles ADD CONSTRAINT no_space_in_file_path CHECK (NOT file_path ~ '\s');
To-do: Decide if these should be constraints or if perhaps we should use even more stringent constraints. We could check file_path against some regular expression, for example, to ensure that it really looks like a file path. For example, currently all file_path values match '^([a-z.-]+:)?//?([.a-z0-9]+/)+[.A-Za-z0-9_-]*/?$'. (SELECT file_path FROM dbfiles WHERE NOT file_path ~ '^([a-z.-]+:)?//?([.a-z0-9]+/)+[.A-Za-z0-9_-]*/?$'; returns zero rows.)
The other textual field involved in the uniqueness constraints is container_type. It currently has one of three values:
select distinct container_type from dbfiles;
container_type
----------------
Model
Posterior
Input
(3 rows)
If this set is relatively stable, we could either (1) define an ENUM type to use for this column; (2) add the following check constraint:
ALTER TABLE dbfiles ADD CONSTRAINT valid_container_type CHECK (container_type IN ('Model', 'Posterior', 'Input'));
Not-NULL violations
The following queries show that only file_path and container_id columns sometimes contain NULL:
SELECT * FROM dbfiles WHERE file_name IS NULL;
SELECT * FROM dbfiles WHERE file_path IS NULL;
SELECT * FROM dbfiles WHERE machine_id IS NULL;
SELECT * FROM dbfiles WHERE container_type IS NULL;
SELECT * FROM dbfiles WHERE container_id IS NULL;
If there are cases where file_path is not relevant, the empty string could be used instead:
UPDATE dbfiles SET file_path = '' WHERE file_path IS NULL;
Non-NULL constraints
ALTER TABLE dbfiles ALTER COLUMN file_name SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN file_path SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN machine_id SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN container_type SET NOT NULL;
ALTER TABLE dbfiles ALTER COLUMN container_id SET NOT NULL;
- I replaced null values of file_path in dbfiles with an empty string
@robkooper @mdietze
- the remaining null container_id violation here is https://www.betydb.org/dbfiles/215 -
- some of the records in dbfiles with a file path that starts
/usr/local/ebi/paperclip/files/have auto-generated file names that are included at the end of the file_path field. The file_name field contains readable text, but does not actually refer to the file name.- for dbfiiles.id = 215, the file_name is
umb.clmand the file.path is/usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55.
- for dbfiiles.id = 215, the file_name is
more /usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55
- are we actively using files in the paperclip directory? Should these perhaps be renamed and placed somewhere else?
- note that umb.clm and US-UMB.CLM (dbfiles.id = 215 and 209) are similarly named but have different contents
id 215: It is unclear to me from this message what the "violation" is in this record. It has a machine, file path, and file name. Is there something wrong with the values for any of these?
Paperclip: Yes, we still want those files. I'm fine if you migrate them to another location on the server, so long as you update all the file paths, but is that really worth the effort? (i.e. is the paperclip folder a problem somehow?)
UMB clim files: why is having two different files having SIMILAR (but different) names a problem?
It is unclear to me from this message what the "violation" is in this record.
dbfiles.id = 215 has no container_id, so it violates a not-null constraint
Paperclip: Yes, we still want those files
The issue with the paperclip files is that the values in the file_path and file_name fields are incorrect. Paperclip gave a random name to each file, though this name is currently under `file_path This could be corrected by either a) renaming the file or b) updating the records, e.g.
update dbfiles
set file_name = '8e0fe0cd540da70c1803e9d001d5ff55',
file_path = '/usr/local/ebi/paperclip/files/8e0/fe0/cd5/8e0fe0cd540da70c1803e9d001d5ff55',
updated_at = now()
where id = 215
This should be okay, as long as the dbfiles.file_name field is not used to identify the file. If we use the 'inputs.name' field or other information to identify the file.
UMB clim files: why is having two different files having SIMILAR (but different) names a problem?
Its not a problem - I was only noting that these are not duplicates.
id 215: Went through all the input records and I believe that the record associated with this was probably deleted so this dbfile record can be deleted
paperclip: I didn't realize the last bit in the PATH was the file name, not part of the path (that's not how that record is supposed to be used). I think it would be much better to change the file name back to the original (what's in the filename field) than to change the database record to the same random string that @mulroony generated. I think Pat's original code was going to give the user the file back with the original name when the file was requested.
delete from dbfiles where id = 215;
files copied to /home/share/data/dbfiles/ and database updated (see gist)
@mdietze the /usr/local/ebi/paperclip/ does indeed have two directories - one file_names and the other files. There is also raws. In any case, redundant with what we have in dbfiles, and obsolete (the paperclip gem moved on).
@gsrohde this should be ready to implement. I've updated the tasks
@dlebauer Are we dispensing with this constraint:
ALTER TABLE dbfiles ADD UNIQUE (container_type, container_id);
As I mentioned, there are many violations.
I somehow missed that you were adding the constraint that the container_type and container_id pair be unique. That constraint SHOULD BE DROPPED since it counter to the database design. The design of the dbfiles table REQUIRES that these be non-unique since they point of the dbfiles table is to allow a 1:many relationship between the "container" and the individual files. For example, a single Input record may refer to many instances of the same identical individual file, which, for example, may be located on different machines. Likewise, an individual Input record may refer to multiple different files.
@dlebauer Is there ever a case where the same file will be referenced more than one in this table, such as the same file stored in two different locations? If not, we can also require uniqueness for md5.
md5 won't be unique because the files won't be unique (indeed md5 was added to ensure that two files are identical, not unique). The same files will often be stored in multiple locations. I can't see any sensible uniqueness constraints on the dbfiles table other than two rows should not be identical in all entries. Is that even worth the effort?
The first column set mentioned -- (file_name, file_path, machine_id) -- should be unique. Unless perhaps a given file can be associated with multiple containers.
that sounds reasonable, though file_name should be able to be either NULL or default to blank
What does it mean for file_name to be blank or NULL? There currently is only one case of this.
if it is a directory that is referred to the filename can be blank.
So if there is more than one file in that directory, how will you know which one it is? Or will it just be assumed there is exactly one file in the directory if file_name is blank? More to the point, what is the use case for leaving file_name blank?
The entry would be referring to ALL the files in the directory, not a single file. Even when file_name isn't blank it is often referring to multiple files and file_name is actually just the common prefix to all the files. A good example would be a GIS shapefile, which is actually 4 files, each with the same prefix but with different extensions. Another common example in PEcAn is meteorology, where a folder might contain files that look like PREFIX_YEAR.nc where there is a common prefix and extension but one file per year rather than the data being all in one file. A met folder might commonly contain a few years (Ameriflux), many decades (reanalysis), or even millennia (PalEON)