arcgis-python-api
arcgis-python-api copied to clipboard
Implement pandas categorical to domain
FeatureLayer could have an option to create dataframe columns as categorical using the layer properties
layer.properties['fields'][17]
{'name': 'type_field',
'type': 'esriFieldTypeString',
'alias': 'Type:',
'sqlType': 'sqlTypeOther',
'length': 255,
'nullable': True,
'editable': True,
'domain': {'type': 'codedValue',
'name': 'cvd_type_field',
'codedValues': [{'name': 'Type1', 'code': 1},
{'name': 'Type2', 'code': 2},
{'name': 'Type3', 'code': 3},
{'name': 'Type4', 'code': 4},
{'name': 'No Type', 'code': 5}]},
'defaultValue': None}```
Describe the solution you'd like
Using pd.Categorical reduces the memory overhead for large data. I'd like arcgis.feature.FeatureLayer
to have an option to handle it.
The implementation seams feasible, for online data, for featurclass it can be done with arcpy backend to.
It would be good to be able get categorical and to write featureclass with domain if dataframe has categorical.
Describe alternatives you've considered
def domain_to_categorical(feature_layer: FeatureLayer, sdf: Union[GeoAccessor, pd.DataFrame]=pd.DataFrame([])) -> GeoAccessor:
if sdf.empty:
sdf = feature_layer.query().sdf
sdf = sdf.copy(deep=True)
for field in controle.properties['fields']:
field_name = field['name']
domain = field.get('domain')
field_in_sdf_column = field['name'] in sdf.columns
if domain:
if domain.get('type') == "codedValue" and field_in_sdf_column:
dtype_name = sdf[field_name].dtype.name
_, names = zip(*(d.values() for d in domain.get('codedValues')))
names = pd.Series(names, dtype=dtype_name)
sdf[field_name] = pd.Categorical(sdf[field_name], categories=names)
return sdf
Additional context This would save some memory for a simple comparison, in small data We can already see memory going down.
Normal
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1145 entries, 0 to 1144
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objectid 1145 non-null Int64
1 globalid 1145 non-null string
2 r_survey_deviceid 1145 non-null string
3 r_survey_id 1145 non-null string
4 r_survey_user_name 1117 non-null string
5 tipo_trabalho 1145 non-null string
6 filtro_tipo_trabalho 1145 non-null string
7 acesso_reabertura 117 non-null string
8 tipo_geometria 1145 non-null string
9 dhpl 1144 non-null string
10 furo 10 non-null string
11 alvo 3 non-null string
12 data_key 1145 non-null string
13 largura_m 1141 non-null Float64
14 comprimento_m 1141 non-null Float64
15 area_m2 1141 non-null Float64
16 tipo_abertura 1140 non-null string
17 vegetacao 1116 non-null string
18 abertura_status 1145 non-null string
19 abertura_rec_obs 62 non-null string
...
25 abertura_rec_datahora 762 non-null datetime64[ns]
26 SHAPE 1145 non-null geometry
dtypes: Float64(3), Int64(1), datetime64[ns](3), geometry(1), string(19)
memory usage: 246.1 KB
Categorical
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1145 entries, 0 to 1144
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 objectid 1145 non-null Int64
1 globalid 1145 non-null string
2 r_survey_deviceid 1145 non-null string
3 r_survey_id 1145 non-null string
4 r_survey_user_name 1117 non-null string
5 tipo_trabalho 1145 non-null category
6 filtro_tipo_trabalho 1145 non-null string
7 acesso_reabertura 117 non-null string
8 tipo_geometria 1145 non-null string
9 dhpl 1144 non-null string
10 furo 10 non-null string
11 alvo 1 non-null category
12 data_key 1145 non-null string
13 largura_m 1141 non-null Float64
14 comprimento_m 1141 non-null Float64
15 area_m2 1141 non-null Float64
16 tipo_abertura 1140 non-null category
17 vegetacao 1116 non-null category
18 abertura_status 1145 non-null string
19 abertura_rec_obs 62 non-null string
...
25 abertura_rec_datahora 762 non-null datetime64[ns]
26 SHAPE 1145 non-null geometry
dtypes: Float64(3), Int64(1), category(4), datetime64[ns](3), geometry(1), string(15)
memory usage: 216.0 KB