osm2pgsql icon indicating copy to clipboard operation
osm2pgsql copied to clipboard

Proposal for new data types

Open joto opened this issue 1 year ago • 6 comments

Several issues have come up related to data types that can be used in an define_table() command and how data is converted from Lua to osm2pgsql. In all these cases it is possible to use the generic sql_type mechanism of osm2pgsql to reach the intended goal, but it is a bit awkward, always needs explaining and exposes the user to possible errors which are hard to understand (cryptic error messages from COPY). It would be nice, if we can say: "Yes, osm2pgsql supports these common constructs out of the box".

Geography

Osm2pgsql has always supported the geometry datatype with its subtypes like Point, Linestring, etc. and the setting of the SRID. But it doesn't natively support geography data types with their variants. It is easy enough to work around this using the sql_type setting, but still a bit awkward. This has been discussed here.

Proposal: Create new datatypes geography, geography-point, geography-linestring, and so on. Default projection would be 4326. Generate an error if the projection is not valid for a geography data type.

Variant A: Also add geometry-point as alias for point geometry and so on, giving us a consistent naming scheme.

Implementation: Is trivial, just the new types need to be recognized and the different default for the projection. There is no need to write special WKB or so.

Identity

Unique IDs on tables are often useful. Sometimes they need to be generated. How to do this is documented in the manual. But this is a bit awkward and we always get questions on this and the use of the serial type.

Proposal: Create new types id2, id4, and id8 (with aliases smallid, id, and bigid) that create integer identity columns of the specified size with GENERATED ALWAYS AS IDENTITY. It will also set create_only to true and not_null, i.e. osm2pgsql will not try to fill this column.

Variant A: Use identity instead of id in the type names. Not so easy to confuse with int.

Variant B: Also automatically generate a unique index for all ID columns. Might be too "magic", and not every use case needs one.

Variant C: Add a sequence option, which allows setting the sequence name instead of creating a default one. For special use cases (say using the same ID space for several tables), users can do a CREATE SEQUENCE before running osm2pgsql and then refer to that.

Implementation: Only the new types need to be recognized and a few SQL templates extended. A bit more to do if a sequence can be set, but still easily done.

Arrays

PostgreSQL can store arrays of any type. We use this in osm2pgsql to store the list of member node IDs of a way, for instance. Currently you need to define them in Lua as sql_type = 'int8[]' etc. and then build the context as text yourself: nodes = '{' .. table.concat(object.nodes, ',') .. '}'. See also the question in discussions forum on this.

Proposal: Add new data types int-array and its variants for the different integer types and add an automatic conversion from Lua tables with integers. The user doesn't have to write the conversion any more and we can generate better error message if the data is invalid for the type.

Variant A: Allow scalar values in conversion, will result in a single-element array.

Variant B: Also add this for some other scalar types. The only ones that make sense are probably real and text.

Variant C: Use the PostgreSQL syntax int[] and its variants instead.

Variant D: Make this a general feature that works for all data types, i.e. adding [] makes any type into an array.

Implementation: Add the new data types and new conversion functions. Depends on the variant how much work this will be.

joto avatar Dec 09 '24 16:12 joto

My 2 cents:

Geography: Implementation is easy but the type definition looks somewhat hacky to me. I'd leave that be for now, as it is rather niche.

** Identity **: Shouldn't that be rather a sub-function of ids, i.e. it is another ID column next to the one based on OSM IDs. Maybe we need to rethink here how rows are identified.

** Arrays**: automatic array building from Lua tables sounds useful. I'd go for variant D and error out on types, where Posgresql doesn't support arrays.

lonvia avatar Jan 06 '25 09:01 lonvia

Regarding the ids column: I don't think this is something that has to do with ids, because columns defined with ids are the ones that osm2pgsql uses for its own needs (when updating the data). They must be tied to something osm2pgsql knows, i.e. the OSM object type/id or a tile x/y for generalized data. Creating unique ids is not useful for osm2pgsql, it is only used downstream.

joto avatar Jan 06 '25 10:01 joto

I like the idea of being able to add IDENTITY columns. I like Variant B because I don't see the point of an identity column without a UNIQUE constraint, or even PK. It seems if you're okay with the overhead of adding a surrogate key, you'd take it the next step too.

I have a hard time imagining where variant C has much value. That seems like a pretty niche edge case and I suspect any instances with that setup already have code to do exactly what they need. If Postgres had IDENTITY support long ago I don't think sequences would ever have been talked about.

rustprooflabs avatar Jan 15 '25 03:01 rustprooflabs

Hello,

I liked to be able to define a column in a table giving the ratio "cartesian distance" / "real = spheroid distance" on a GIS object.

My use case: Based on OSM + 0sm2pgsql I calculate "global" pseudo-tags for a bike-routing project. This is very fast using srid 3857, but n some situations distances have to be corrected depending on the latitude.

  • I tested using geography type (works well with osm2pgsql!) but the computing-times are at least 10 higher
  • So in Postgis I currently add manually a column a.e. in the lines table with this code: ........................... select osm_id, highway, maxspeed, way, waterway, li.natural, width, oneway, st_length(way) / st_length(ST_Transform(way,4326)::geography) as merca_coef into table lines_bis from lines li; select now(); drop table lines; alter table lines_bis rename to lines; create index lines_osm_id_idx on lines(osm_id) WITH (fillfactor='100'); CREATE INDEX lines_way_idx ON public.lines USING gist (way) WITH (fillfactor='100'); analyze lines; ....................................

by need, the column merca_coef is then used in the sql´s, very much faster as working in geography mode.
(distance-real = distance(3857) / merca_coef)

Could it become possible to add on a way clolumn an associated "merca_coef"? (so defining a kind of "scale" related to the way)

Thank for the very good programm and regards

EssBee59 avatar Mar 17 '25 10:03 EssBee59

@EssBee59 Please do not reuse an existing issue for something unrelated, open a new discussion instead.

For your problem: You can not run SQL queries on import, that's not possible with osm2pgsql and most likely will never be, because we are using the COPY command. Have a look at the chapter on geometry objects in the manual. Maybe you can use the length and transform functions to do what you need. If not, create a real column in your table and use the create_only setting. Then after running osm2pgsql use an SQL UPDATE merca_coef = ... WHERE merca_coef IS NULL to fill in that column.

joto avatar Mar 17 '25 13:03 joto

Arrays

Array datatypes would be useful and reduces boilerplate code in Lua Osm2Psql import scripts.

Both numeric and string array content is useful. Numeric for referencing collections of OSM IDs, and string arrays for ;-split OSM tag values.

For example I use

osm2pgsql.define_table({
  # ...
  columns = {
    # ...
    { column = 'operator', sql_type = 'text[]' },
  }
})

with pre-generated SQL array syntax to fill the column.

Having specific int8[] and text[] types that automatically convert Lua tables to valid (escaped!) SQL in the correct type would be useful. I use a conversion function like

function to_sql_array(items)
  -- Put the items in a table into a raw SQL array string (quoted and comma-delimited)
  if not items then
    return nil
  end

  local result = '{'

  for index, item in ipairs(items) do
    if index > 1 then
      result = result .. ','
    end

    -- Raw SQL array syntax
    result = result .. "\"" .. item:gsub("\\", "\\\\"):gsub("\"", "\\\"") .. "\""
  end

  return result .. '}'
end

For the proposal on the implementation variants: Variant D sounds most general and easy to use. The implementation could accept only tables, and perform the same conversion for each item into the target SQL type. This would be obvious to Osm2Psql users because the array types would have the same conversion behaviour as singular types.

I expect integrating array-type-support into Osm2Psql would also be more performant, to avoid allocating many strings just to convert a Lua table into an SQL string. Streaming this directly from the input table to Postgres should (theoretically) lower allocations.

hiddewie avatar Jul 27 '25 14:07 hiddewie