eurostat Duplicated variable columns in geospatial data

Duplicated variable columns in geospatial data

Open pitkant opened this issue 2 years ago • 2 comments

Currently different geospatial datasets have the following columns:

year / variable	2003	2006	2010	2013	2016	2021
id	x	x	x	x	x	x
LEVL_CODE	x	x	x	x	x	x
NUTS_ID	x	x	x	x	x	x
CNTR_CODE	x	x	x	x	x	x
NAME_LATN		x	x	x	x	x
NUTS_NAME	x	x	x	x	x	x
MOUNT_TYPE					x	x
URBN_TYPE					x	x
COAST_TYPE					x	x
FID	x	x	x	x	x	x
geometry	x	x	x	x	x	x
geo	x	x	x	x	x	x

Of these, at least in years 2016 and 2021, the following variables contain identical information: id, NUTS_ID, FID and geo. The id column is the unique identifier from geojson and not included in the csv file. The geo column is generated at the end of get_eurostat_geospatial "for easier joins with dplyr", as well as in data generation script data_spatial.R.

While some of this overlap is due to eurostat data itself containing duplicated columns, is geo column still necessary?

Jul 01 '22 14:07 pitkant

If this can be easily retrieved otherwise when needed (example, maybe?) then I guess it could also removed.

Jul 01 '22 14:07 antagomir

Addressed partly in v4-dev branch and PR #264. geo column is now marked in get_eurostat_geospatial function documentation as "Questioning", offering us some more time to discuss whether we should remove it or keep it in the future.

I will close this issue when v4 is released

Aug 01 '23 08:08 pitkant

eurostat eurostat copied to clipboard

Duplicated variable columns in geospatial data

eurostat
eurostat copied to clipboard