bigmetadata icon indicating copy to clipboard operation
bigmetadata copied to clipboard

Reduce number of variables

Open michellemho opened this issue 7 years ago • 11 comments

I think there are too many variables in the Data Observatory. This makes searching, organizing, and navigating the measurements nearly impossible with the systems we have now (the Catalog and the Builder UI). I think we should consider reducing the number of variables to just few hundred key variables per country to ensure quality before adding more quantity.

There are thousands of detailed variables that are unnecessary. Brazil and Australia are especially bad. For example there is a variable in Brazil for "Daughters of 38 year-old Guardian and Spouse" (br.data.Pessoa07_V039) ... this is too much detail. There are thousands more that are similar. Just look how massive this page is! http://cartodb.github.io/bigmetadata/br/age_gender.html

michellemho avatar Jun 14 '17 20:06 michellemho

This reduction issue should be fixed before tackling the names and descriptions (which is described in this issue https://github.com/CartoDB/bigmetadata/issues/168)

michellemho avatar Jun 14 '17 20:06 michellemho

I'll need more feedback on this, but I believe users are most interested in getting total population, population breakdowns by age and gender, median income, and maybe just a handful of other variables.

John hand-curated the variables available for the United States from the American Community Survey (ACS). There are thousands of variables to choose from-- he only included a few hundred. These names and descriptions were manually written into the ETL process in the ColumnsTask of the acs.py file.

michellemho avatar Jun 14 '17 20:06 michellemho

Yep, we can probably begin with a set of generic values that are available for most of the countries and let the most specific ones aside. In the future, we might be interested in providing users the possibility to add them to their accounts only if they're interested in them.

juanignaciosl avatar Jun 15 '17 05:06 juanignaciosl

@juanignaciosl @saleiva @stuartlynn @ethervoid @javitonino @kevin-reilly This is my first pass to come up with bare minimum variables + key variables for the Data Observatory. The main purposes are: 1) consistency across countries and 2) better names and descriptions. When we start adding in censuses, we should align as close as possible to these variables first. I'll use this list to "prune" the existing Data Observatory (especially Australia, Canada, and Brazil). If other variables exist and are available, they should be added on an ad-hoc basis by country.

Minimum demographic variables for all countries:

Age & Gender

  • Total Population
  • Male Population
  • Female Population
  • Population by age groups (varies country to country)
  • Population by age groups and gender (varies country to country)

Households (Families)

  • Total households

Housing

  • Total housing units

Key demographic variables (availability varies by country)

Households (Families)

  • Average household size
  • Number of households by size
  • Number of people by marriage status (single, married, divorced, separated, widowed)
  • Number of households or families with children

Housing

  • Occupied housing
  • Owner-occupied housing
  • Renter-occupied housing
  • Vacant housing
  • Number of housing units by type (apartment, semi-attached, etc.)
  • Number of housing units by year built
  • Number of housing units by size (1 bedroom, 2 bedroom, etc.)

Income

  • Median household income
  • Number of people in poverty or receiving public assistance

Employment

  • Economically active population
  • Employed population
  • Unemployed population
  • Economically inactive population

Education

  • Number of enrolled students by level
  • Number of people by educational attainment

Nationality

  • Population by place of birth

Race and Ethnicity

  • Population by race and ethnicity groups

Religion

  • Number of people by religion

Language

  • Number of people by language spoken at home

Commerce & Economy

  • Number of businesses by industry

Health

  • Life expectancy
  • Birth rate
  • Death or mortality rate
  • Number of people with health insurance by type

michellemho avatar Jun 26 '17 19:06 michellemho

One thing @stuartlynn and I just spoke about was introducing a "public" tag to the data in the DO. The dataset Michelle lists above would be "public" and everything else would be private. This would allow us to ingest any data we may want for internal purposes but only publish certain sets to Builder users.

This would improve UI performance and, I think, allow us to do some of the smarter filtering we wanted to do.

(cc: @noguerol)

kevin-reilly avatar Jun 26 '17 21:06 kevin-reilly

Do we have any data or log about the actual consumption of measurements?

saleiva avatar Jun 27 '17 10:06 saleiva

@saleiva AFAIK we don't have metrics about the consumption of measurements. We have to add them to the DS metrics in order to be able to query them.

As a temporal solution we could go through the named maps in Redis and gather the data from the analysis config

ethervoid avatar Jun 27 '17 10:06 ethervoid

As a leapfrog I've made an script and gathered some stats from named maps in redis. Here you have a csv file with the id,name,description and number of uses of that measure in analysis in production.

The top five most used measures in analysis are (in descendent order):

id          | us.census.acs.B01003001
hits        | 299
name        | Total Population
description | The total number of all people living in a given geographic area.  This is a very useful catch-all denominator when calculating rates.
id          | es.ine.t1_1
hits        | 229
name        | Total population
id          | us.census.acs.B19301001
hits        | 179
name        | Per Capita Income in the past 12 Months
description | Per capita income is the mean income computed for every man, woman, and child in a particular group. It is derived by dividing the total income of a particular group by the total population.
id          | us.census.acs.B19013001
hits        | 105
name        | Median Household Income in the past 12 Months
description | Within a geographic area, the median income received by every household on a regular basis before payments for personal income taxes, social security, union dues, medicare deductions, etc.  It includes income received from wages, salary, commissions, bonuses, and tips; self-employment income from own nonfarm or farm businesses, including proprietorships and partnerships; interest, dividends, net rental income, royalty income, or income from estates and trusts; Social Security or Railroad Retirement income; Supplemental Security Income (SSI); any cash public assistance or welfare payments from the state or local welfare office; retirement, survivor, or disability benefits; and any other sources of income received regularly such as Veterans' (VA) payments, unemployment and/or worker's compensa
tion, child support, and alimony.
id          | us.census.acs.B23025004
hits        | 76
name        | Employed Population
description | The number of civilians 16 years old and over in each geography who either (1) were "at work," that is, those who did any work at all during the reference week as paid employees, worked in their own business or profession, worked on their own farm, or worked 15 hours or more as unpaid workers on a family farm or in a family business; or (2) were "with a job but not at work," that is, those who did not work during the reference week but had jobs or businesses from which they were temporarily absent due to illness, bad weather, industrial dispute, vacation, or other personal reasons. Excluded from the employed are people whose only activity consisted of work around the house or unpaid volunteer work for religious, charitable, and similar organizations; also excluded are all institutionalized p
eople and people on active duty in the United States Armed Forces.

Hope it helps to know a bit more of the current status of DO. Happy weekend

// @saleiva @juanignaciosl @noguerol for awareness

ethervoid avatar Jun 30 '17 21:06 ethervoid

Another round of data:

Top five:

id,name,hits
us.census.acs.B01003001,Total Population,331
es.ine.t1_1,Total population,274
us.census.acs.B19301001,Per Capita Income in the past 12 Months,192
us.census.acs.B19013001,Median Household Income in the past 12 Months,136
us.census.acs.B23025004,Employed Population,86

Complete file is here

ethervoid avatar Oct 16 '17 15:10 ethervoid

@ethervoid that's awesome.

Is it possible to get the full list? Those are the ones that I would guess are most used but the ones further down the list would be interesting as well.

stuartlynn avatar Oct 17 '17 14:10 stuartlynn

Ah sorry, my bad. I didn't see the CSV file

stuartlynn avatar Oct 17 '17 14:10 stuartlynn