r4ds icon indicating copy to clipboard operation
r4ds copied to clipboard

Primary key of `nycflights13::flights` and `nycflights13::weather`

Open jrnold opened this issue 5 years ago • 3 comments

Primary key of flights

In the chapter on Relational Data, section Keys, the text describes the flights table as having no primary key.

Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what’s the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:

However, (year, month, day, hour, carrier, flight) is a primary key.

library("tidyverse")
library("nycflights")
#> Error in library("nycflights"): there is no package called 'nycflights'

flights %>% 
  count(year, month, day, hour, carrier, flight) %>% 
  filter(n > 1)
#> Error in eval(lhs, parent, parent): object 'flights' not found

Created on 2019-01-19 by the reprex package (v0.2.1)

Primary key of weather

The text mentions (year, month, day, hour, origin) as possibly being a primary key for weather but shows in the code that it is not. However, there is no discussion in prose of this failure. Are the duplicated observations intentional?

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
#> # A tibble: 3 x 6
#>    year month   day  hour origin     n
#>   <dbl> <dbl> <int> <int> <chr>  <int>
#> 1  2013    11     3     1 EWR        2
#> 2  2013    11     3     1 JFK        2
#> 3  2013    11     3     1 LGA        2

Problems with nycflights database diagam

In either, case the relational diagram is incorrect: image

  • In flights, the gray shading indicates that (year, month, day, hour, flight) is a primary key, which it is not. The combination of (year, month, day, hour, flight, carrier) would work.
  • In weather, the gray shading of (year, month, day, hour, origin) indicates that it is a primary key, which it is not; with the counterexamples given earlier.

I'm not sure about the relationship between flights and weather. The arrows indicate that (year, month, day, hour, origin) of flights is a foreign key of weather. However, for that relation to hold - those columns need to be a primary key of flights and all values in weather should have a match in flights. Neither of those are true. There are many hours in weather that don't have any flights.

library("tidyverse")
library("nycflights13")

anti_join(weather, flights,
          by = c("year", "month", "day", "hour", "origin")) %>%
  nrow()
#> [1] 6737

I maybe totally wrong on all this though. You may be using a looser definition of these, or another form of relation.

jrnold avatar Jan 20 '19 04:01 jrnold

I came to the Issues forum immediately after reading this part of the book.

It is a well-known fact among those familiar with air travel that each carrier numbers its flights itself. Therefore, the full flight number, with the carrier prefix, should be used to identify a flight departing on a given day worldwide. So (year, month, day, hour, carrier, flight) should be a primary key for flights.

As for weather, the duplicate issue only appears for one hour in the entire database (November 3, 2013, in the 1 AM hour), across all three airports, and there are no reports in the hour prior (12 AM hour). It is likely that the set of observations taken at 12 AM were mistakenly labeled "1 AM". So apart from this glitch in the data, the general rule that (year, month, day, hour, origin) identifies a unique observation still holds.

The last part about the relationship between flights and weather appears to be backwards. In principle, every flight should have a corresponding weather report from the departure airport for the date and time of departure. (year, month, day, hour, origin) identifies this weather report. And as stated above, (year, month, day, hour, origin) can be a primary key for weather (ignoring a data bug). That is what it means for (year, month, day, hour, origin) to be a foreign key to weather in flights.

TroyVan avatar Jul 19 '22 07:07 TroyVan

Hmmmm, daylight savings time ended on Nov 3 2013, so that's probably the source of the duplicates.

hadley avatar Aug 09 '22 18:08 hadley

It looks like you can use the time_hour column to resolve the ambiguity. It does look like the problem is that the year->hour columns don't contain information about whether or not daylight saving time is active, and that means you can have ambiguous times when the clocks rollback an hour.

library(dplyr)
library(nycflights13)

dups <- weather %>%
  filter(year == 2013, month == 11, day == 3, hour == 1) %>%
  select(year, month, day, hour, origin, time_hour)

dups$time_hour
#> [1] "2013-11-03 01:00:00 EDT" "2013-11-03 01:00:00 EST"
#> [3] "2013-11-03 01:00:00 EDT" "2013-11-03 01:00:00 EST"
#> [5] "2013-11-03 01:00:00 EDT" "2013-11-03 01:00:00 EST"

dups %>%
  mutate(is_dst = lubridate::dst(time_hour))
#> # A tibble: 6 × 7
#>    year month   day  hour origin time_hour           is_dst
#>   <int> <int> <int> <int> <chr>  <dttm>              <lgl> 
#> 1  2013    11     3     1 EWR    2013-11-03 01:00:00 TRUE  
#> 2  2013    11     3     1 EWR    2013-11-03 01:00:00 FALSE 
#> 3  2013    11     3     1 JFK    2013-11-03 01:00:00 TRUE  
#> 4  2013    11     3     1 JFK    2013-11-03 01:00:00 FALSE 
#> 5  2013    11     3     1 LGA    2013-11-03 01:00:00 TRUE  
#> 6  2013    11     3     1 LGA    2013-11-03 01:00:00 FALSE

Created on 2022-08-09 by the reprex package (v2.0.1)

DavisVaughan avatar Aug 09 '22 18:08 DavisVaughan