pyshp icon indicating copy to clipboard operation
pyshp copied to clipboard

Default field size and decimal length when writing shapefiles

Open karimbahgat opened this issue 6 years ago • 2 comments

Due to recent changes since 1.2.10, the issue of field and value types have been raised as a concern by several users. Most recently, @klasko2 pointed out in #99 that saving a float value to an 'F' field will save it as an integer, because the default number of decimals is 0 when defining a new field. This begs the more general question for the next version of PyShp:

What should be the default field 'size' and 'decimal' for different field types?

I hope this thread can be used as a place for people to voice their concerns and share their experiences and expectations regarding shapefiles and dbf field types.

The Issue

Until now, field size (i.e. how many bytes) has been always set to 50, and decimal always to 0. Instead, I think the case can be made that any numeric field should default to a decimal number. This leaves us with some open questions:

  1. ...what's a good default size number? Is 50 big enough to store most numbers that an average user would need and at the same time small enough to not waste filesize. For a negative decimal number, this could store a value as low as -100000000000000000000000000000000000000000000000.0, or as detailed as -0.000000000000000000000000000000000000000000000001 (provided the decimal arg is set accordingly)? That might actually seem excessively high for most users so perhaps it should be lowered to produce smaller shapefiles? What's the default in other software?
  2. ...what's a good default decimal number? Would 6 decimal places retain enough information for the average user not to feel they are losing information? This would mean floats being rounded to e.g. 0.123456. Perhaps this is too small, should it be instead 12 or 16? What's the default in other software?
  3. ...should size and decimal be the same for 'F' and 'N' fields? Float fields are decimals by definition, but Numeric fields can be both ints or floats. One might argue that both should default to decimal numbers, since defaulting to ints would result in lost information for unsuspecting users. Manually setting decimal=0 can be done if the user is certain they just want to save ints.

For the remaining field types I think the following would be non-controversial:

  • Type 'C': size=80, decimal irrelevant. Text fields are typically longer than numeric fields, and I believe that's the default QGIS text field size. This would save text values as long as abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde.
  • Type 'L': size=1, decimal irrelevant.
  • Type 'D': size=8, decimal irrelevant.

Any and all thoughts are appreciated!

karimbahgat avatar Aug 29 '17 16:08 karimbahgat

I have no idea if this is still active, as it was created in 2017, but it's open and it looks like it was added to a milestone in 2022, so here we go:

First, I see from the docs that 'F' and 'N' are the same -- that is a pity, it would be really good to support an actual integer type. For example, we are writing shape files that have truly integer fields, but they are getting detected as Real by other software -- for example OGR (GDAL) (via Python):

print(feature)
OGRFeature(Test-Model2023-11-18T06):1
  Time (String) = 2023-11-18T06:00:00
  LE_id (Real) = 1
  Depth (Real) = 0
  Mass (Real) = 15
  Age (Real) = 43200
  Surf_Conc (Real) = 0.00164
  Status_Cod (Real) = 2
  POINT (-89.3187777840845 28.8072723576694)
print(feature.items())
{'Time': '2023-11-18T06:00:00',
 'LE_id': 1.0,
 'Depth': 0.0,
 'Mass': 15.0,
 'Age': 43200.0,
 'Surf_Conc': 0.00164,
 'Status_Cod': 2.0}

It would be really nice if the integers could come through as integers. I can convert in the client code, but that's a limitation in the discoverability of the data.

(in the above: 'Depth': 0.0 really should be a float, whereas 'LE_id': 1.0 should really be an integer, as as an ID that really matters.

And making 'N' and 'F' be different would help with the defaults -- an integer wold be an integer :-)

I haven't looked carefully at the DBF format yet -- the ESRI shapefile spec helpfully (not) just references the DBASE spec -- without even a link :-(

But according to wikipedia:

""" Supported field types are: floating point (13 character storage), integer (4 or 9 character storage), date (no time storage; 8 character storage), and text (maximum 254 character storage) """ if this is correct, then:

what's a good default size number? Is 50

It doesn't sound like anything over 12 is supported anyway. But if you are correct that it's an option to go larger, maybe these values are reasonable defaults?

...what's a good default decimal number? Would 6 decimal places retain enough information for the average user not to feel they are losing information? This would mean floats being rounded to e.g. 0.123456.

This is a serious challenge -- there simply is no default if you have to have a fixed number of decimal places -- it depends on the order of magnitude of the number. Is actual floating point not an option (that is: 1.234e10 and 1.234e-10 -- same amount of precision, totally different number of places after the decimal point) If it does have to be fixed, I think there should be no default -- it depends on what data you are trying to store, and only the person writing the data can know what's appropriate.

(I got to say -- it is really bad that we are so dependent on such an ancient file format! -- but what can you do?)

Perhaps this is too small, should it be instead 12 or 16? What's the default in other software?

wait! looking now at your docs, it seems it DOES support true float.

e.g: `>>> r = shapefile.Reader('shapefiles/test/dtype')

assert r.record(0) == [1, 1.32, 1.3217328, -3.2302e-25, 1.3217328, ...`

In that case, a C float is about 8 decimal digits, and double 16 -- a Python double is 16 digits. So 8 or 16 digits would be reasonable defaults.

For integers, 64 bit ints support 20 digits, but those are really big, 32 bit ints are, I think 10 digits, so not a bad default.

If I'm totally wrong here, do you have a pointer to the spec for the DBF format as used by shapefiles? I haven't been able to find it yet.

I did find this:

http://www.manmrk.net/tutorials/database/xbase/data_types.html#DATA_TYPES but it looks more extensive than what a shapefiles support. But it does limit "N" to 18 chars.

ChrisBarker-NOAA avatar Nov 22 '23 04:11 ChrisBarker-NOAA

Looking a bit more, perhaps you could follow similar defaults, etc to the OGR Shapefile writer:

(https://gdal.org/drivers/vector/shapefile.html)

Shapefile feature attributes are stored in an associated .dbf file, and so attributes suffer a number of limitations:

Attribute names can only be up to 10 characters long. The OGR Shapefile driver tries to generate unique field names. Successive duplicate field names, including those created by truncation to 10 characters, will be truncated to 8 characters and appended with a serial number from 1 to 99.

For example:

a → a, a → a_1, A → A_2;

abcdefghijk → abcdefghij, abcdefghijkl → abcdefgh_1

Only Integer, Integer64, Real, String and Date (not DateTime, just year/month/day) field types are supported. The various list, and binary field types cannot be created.

The field width and precision are directly used to establish storage size in the .dbf file. This means that strings longer than the field width, or numbers that don't fit into the indicated field format will suffer truncation.

Integer fields without an explicit width are treated as width 9, and extended to 10 or 11 if needed.

Integer64 fields without an explicit width are treated as width 18, and extended to 19 or 20 if needed.

Real (floating point) fields without an explicit width are treated as width 24 with 15 decimal places of precision.

String fields without an assigned width are treated as 80 characters.

ChrisBarker-NOAA avatar Nov 22 '23 17:11 ChrisBarker-NOAA