arcgis-python-api icon indicating copy to clipboard operation
arcgis-python-api copied to clipboard

Implement pandas categorical to domain

Open hildermesmedeiros opened this issue 1 year ago • 0 comments

FeatureLayer could have an option to create dataframe columns as categorical using the layer properties

layer.properties['fields'][17]

{'name': 'type_field',
 'type': 'esriFieldTypeString',
 'alias': 'Type:',
 'sqlType': 'sqlTypeOther',
 'length': 255,
 'nullable': True,
 'editable': True,
 'domain': {'type': 'codedValue',
  'name': 'cvd_type_field',
  'codedValues': [{'name': 'Type1', 'code':  1},
   {'name': 'Type2', 'code':  2},
   {'name': 'Type3', 'code':  3},
   {'name': 'Type4', 'code': 4},
   {'name': 'No Type', 'code': 5}]},
 'defaultValue': None}```

Describe the solution you'd like Using pd.Categorical reduces the memory overhead for large data. I'd like arcgis.feature.FeatureLayer to have an option to handle it. The implementation seams feasible, for online data, for featurclass it can be done with arcpy backend to.

It would be good to be able get categorical and to write featureclass with domain if dataframe has categorical.

Describe alternatives you've considered

def domain_to_categorical(feature_layer: FeatureLayer, sdf: Union[GeoAccessor, pd.DataFrame]=pd.DataFrame([])) -> GeoAccessor:
    if sdf.empty:
        sdf = feature_layer.query().sdf

    sdf = sdf.copy(deep=True)
    for field in controle.properties['fields']:
        field_name = field['name']
        domain = field.get('domain')
        field_in_sdf_column = field['name'] in sdf.columns
        if domain:
            if domain.get('type') == "codedValue" and field_in_sdf_column:
                dtype_name = sdf[field_name].dtype.name
                _, names = zip(*(d.values() for d in domain.get('codedValues')))
                names = pd.Series(names, dtype=dtype_name)
                sdf[field_name] = pd.Categorical(sdf[field_name], categories=names)
    return sdf

Additional context This would save some memory for a simple comparison, in small data We can already see memory going down.

Normal

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1145 entries, 0 to 1144
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   objectid               1145 non-null   Int64         
 1   globalid               1145 non-null   string        
 2   r_survey_deviceid      1145 non-null   string        
 3   r_survey_id            1145 non-null   string        
 4   r_survey_user_name     1117 non-null   string        
 5   tipo_trabalho          1145 non-null   string        
 6   filtro_tipo_trabalho   1145 non-null   string        
 7   acesso_reabertura      117 non-null    string        
 8   tipo_geometria         1145 non-null   string        
 9   dhpl                   1144 non-null   string        
 10  furo                   10 non-null     string        
 11  alvo                   3 non-null      string        
 12  data_key               1145 non-null   string        
 13  largura_m              1141 non-null   Float64       
 14  comprimento_m          1141 non-null   Float64       
 15  area_m2                1141 non-null   Float64       
 16  tipo_abertura          1140 non-null   string        
 17  vegetacao              1116 non-null   string        
 18  abertura_status        1145 non-null   string        
 19  abertura_rec_obs       62 non-null     string        
...
 25  abertura_rec_datahora  762 non-null    datetime64[ns]
 26  SHAPE                  1145 non-null   geometry      
dtypes: Float64(3), Int64(1), datetime64[ns](3), geometry(1), string(19)
memory usage: 246.1 KB

Categorical

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1145 entries, 0 to 1144
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   objectid               1145 non-null   Int64         
 1   globalid               1145 non-null   string        
 2   r_survey_deviceid      1145 non-null   string        
 3   r_survey_id            1145 non-null   string        
 4   r_survey_user_name     1117 non-null   string        
 5   tipo_trabalho          1145 non-null   category      
 6   filtro_tipo_trabalho   1145 non-null   string        
 7   acesso_reabertura      117 non-null    string        
 8   tipo_geometria         1145 non-null   string        
 9   dhpl                   1144 non-null   string        
 10  furo                   10 non-null     string        
 11  alvo                   1 non-null      category      
 12  data_key               1145 non-null   string        
 13  largura_m              1141 non-null   Float64       
 14  comprimento_m          1141 non-null   Float64       
 15  area_m2                1141 non-null   Float64       
 16  tipo_abertura          1140 non-null   category      
 17  vegetacao              1116 non-null   category      
 18  abertura_status        1145 non-null   string        
 19  abertura_rec_obs       62 non-null     string        
...
 25  abertura_rec_datahora  762 non-null    datetime64[ns]
 26  SHAPE                  1145 non-null   geometry      
dtypes: Float64(3), Int64(1), category(4), datetime64[ns](3), geometry(1), string(15)
memory usage: 216.0 KB

hildermesmedeiros avatar Jun 06 '23 15:06 hildermesmedeiros