eurostat icon indicating copy to clipboard operation
eurostat copied to clipboard

Duplicated variable columns in geospatial data

Open pitkant opened this issue 2 years ago • 2 comments

Currently different geospatial datasets have the following columns:

year / variable 2003 2006 2010 2013 2016 2021
id x x x x x x
LEVL_CODE x x x x x x
NUTS_ID x x x x x x
CNTR_CODE x x x x x x
NAME_LATN x x x x x
NUTS_NAME x x x x x x
MOUNT_TYPE x x
URBN_TYPE x x
COAST_TYPE x x
FID x x x x x x
geometry x x x x x x
geo x x x x x x

Of these, at least in years 2016 and 2021, the following variables contain identical information: id, NUTS_ID, FID and geo. The id column is the unique identifier from geojson and not included in the csv file. The geo column is generated at the end of get_eurostat_geospatial "for easier joins with dplyr", as well as in data generation script data_spatial.R.

While some of this overlap is due to eurostat data itself containing duplicated columns, is geo column still necessary?

pitkant avatar Jul 01 '22 14:07 pitkant

If this can be easily retrieved otherwise when needed (example, maybe?) then I guess it could also removed.

antagomir avatar Jul 01 '22 14:07 antagomir

Addressed partly in v4-dev branch and PR #264. geo column is now marked in get_eurostat_geospatial function documentation as "Questioning", offering us some more time to discuss whether we should remove it or keep it in the future.

I will close this issue when v4 is released

pitkant avatar Aug 01 '23 08:08 pitkant