node-mapnik icon indicating copy to clipboard operation
node-mapnik copied to clipboard

Datasource *sometimes* adds `null` attribute to a feature when it lacks an attribute another feature has

Open davidtheclark opened this issue 9 years ago • 23 comments
trafficstars

I'm noticing this oddity when I create a Datasource and iterate through the features.

Let's say feature FA in my GeoJSON has one attribute: foo: 1. And FB has one attribute: bar: 2. When I create a Datasource and check these features within that structure, FA might have two attributes now: foo: 1, bar: null; and same for FB: bar: 2, foo: null.

So it might seem that Mapnik is giving each feature all the attributes available but with null as the value if the feature did not actually have that feature in the original data.

But this is not happening consistently. In some sample data I've been using, I've found that Mapnik adds the null attributes only sometimes. Whether or not this happens does not seem connected with data type, or the number of times the attribute shows up.

I'm not sure what to make of this, or whether I should file this upstream in Mapnik itself.

davidtheclark avatar Jul 11 '16 16:07 davidtheclark

More info: I'm iterating through the datasource's features according to the example in the docs:

var datasource = new mapnik.Datasource(datasourceOptions);
var features = datasource.featureset();
var feature = features.next();
while (feature) {..}

Also, here's the sample GeoJSON that is producing these odd results:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "name": "foo",
        "power": 6,
        "mixed": 7,
        "numbers": [1, 2, 3],
        "astonishing": true
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          84.72656249999999,
          55.7765730186677
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "bar",
        "mixed": "7",
        "power": 99
      },
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [
            84.462890625,
            55.32914440840507
          ],
          [
            86.0009765625,
            55.751849391735284
          ],
          [
            86.044921875,
            56.04749958329888
          ]
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "baz"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          83.49609375,
          56.48676175249086
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "haha"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          85.341796875,
          56.8249328650072
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "hoho",
        "numbers": [1, 2, 3],
        "data": 6
      },
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [
            85.8251953125,
            55.50374985927514
          ],
          [
            86.4404296875,
            55.677584411089505
          ],
          [
            86.7919921875,
            55.65279803318956
          ]
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "hehe",
        "power": 86,
        "data": {
          "foo": "bar"
        },
        "onlyNull": null,
        "description": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              80.947265625,
              57.397624055000456
            ],
            [
              80.5078125,
              56.145549500679074
            ],
            [
              81.650390625,
              56.145549500679074
            ],
            [
              81.650390625,
              56.559482483762245
            ],
            [
              80.947265625,
              57.397624055000456
            ]
          ]
        ]
      }
    }
  ]
}

Attributes that are having null values added: astonishing, mixed, numbers, power

Attributes that not not having null values added: name, data, description, onlyNull

davidtheclark avatar Jul 11 '16 16:07 davidtheclark

@davidtheclark Mapnik (core) reads the first 5 features, by default, to assemble the schema of possible attributes. The default of 5 is conservative for performance reasons on very large feature collections. Can you try passing num_features_to_query=100 to see if this makes the behavior consistent? Refs https://github.com/mapnik/mapnik/issues/3238 | https://github.com/mapnik/mapnik/commit/f44b5ccfd93c0f30b739bc05887412a48771102f

springmeyer avatar Jul 11 '16 16:07 springmeyer

Seems like the same thing is happening when I add num_features_to_query = 100 to my options.

davidtheclark avatar Jul 11 '16 16:07 davidtheclark

@davidtheclark - I'll take a look tomorrow, thanks for reporting.

artemp avatar Jul 11 '16 18:07 artemp

Great! To be clear: what I'd like would be for no null-valued attributes to be inserted into the features. That way I can distinguish between an attribute that was actually set to null in the feature and an attribute that was not a part of (undefined for) the feature.

On Jul 11, 2016, at 11:59 AM, Artem Pavlenko [email protected] wrote:

@davidtheclark - I'll take a look tomorrow, thanks for reporting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidtheclark avatar Jul 11 '16 19:07 davidtheclark

Using mapnik-python

for f in ds.all_features():
...     print(f["astonishing"],f["mixed"],f["numbers"],f["power"],f["name"])
... 
True 7 [1,2,3] 6 foo
None 7 None 99 bar
None None None None baz
None None None None haha
None None [1,2,3] None hoho
None None None 86 hehe

^ looks reasonable to me - features that don't have particular attribute return None in Python

>>> for f in ds.all_features():
...     f.to_geojson()
... 
'{"type":"Feature","id":1,"geometry":{"type":"Point","coordinates":[84.7265625,55.7765730186677]},"properties":{"astonishing":true,"mixed":7,"name":"foo","numbers":"[1,2,3]","power":6}}'
'{"type":"Feature","id":2,"geometry":{"type":"LineString","coordinates":[[84.462890625,55.3291444084051],[86.0009765625,55.7518493917353],[86.044921875,56.0474995832989]]},"properties":{"mixed":"7","name":"bar","power":99}}'
'{"type":"Feature","id":3,"geometry":{"type":"Point","coordinates":[83.49609375,56.4867617524909]},"properties":{"name":"baz"}}'
'{"type":"Feature","id":4,"geometry":{"type":"Point","coordinates":[85.341796875,56.8249328650072]},"properties":{"name":"haha"}}'
'{"type":"Feature","id":5,"geometry":{"type":"LineString","coordinates":[[85.8251953125,55.5037498592751],[86.4404296875,55.6775844110895],[86.7919921875,55.6527980331896]]},"properties":{"data":6,"name":"hoho","numbers":"[1,2,3]"}}'
'{"type":"Feature","id":6,"geometry":{"type":"Polygon","coordinates":[[[80.947265625,57.3976240550005],[80.5078125,56.1455495006791],[81.650390625,56.1455495006791],[81.650390625,56.5594824837622],[80.947265625,57.3976240550005]]]},"properties":{"data":"{foo:\\"bar\\"}","description":"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.","name":"hehe","power":86}}'

Note, "astonishing" appears only once as in original ^

I think some logic is faulty on JS (mapnik.node) side. /cc @davidtheclark @springmeyer

artemp avatar Jul 12 '16 09:07 artemp

After some digging it turned out that the best place to fix this is in https://github.com/mapnik/mapnik feature_kv_iterator

https://github.com/mapnik/mapnik/commit/3397b8f14fa649f8907bd7b2974f1b5792cf99ef#diff-7e10cfbdff7fca38fb1824d1ce771d88R42

^ I modified feture_kv_iterator to skip over non-existent values and also fix feature generator logic to allow null values in output which are valid JSON values https://github.com/mapnik/mapnik/commit/8ce58ea29c61cb9cc75eb660b26f969a39097854

/cc @davidtheclark @springmeyer

artemp avatar Jul 14 '16 10:07 artemp

@artemp to prepare for @davidtheclark testing this I tried node-mapnik master with latest mapnik master but I'm seeing:

 624 passing (7s)
  1 pending
  3 failing

  1) mapnik.Feature  should output the same geojson that it read:
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:123:28)

  2) mapnik.Feature  should output the same geojson that it read (point):
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:148:28)

  3) mapnik.Feature  should output the same geojson that it read (line):
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:173:28)

Can you replicate these failures?

springmeyer avatar Jul 14 '16 12:07 springmeyer

@springmeyer - yes, confirming I'm seeing those as well ^. Back to debugging :)

artemp avatar Jul 14 '16 13:07 artemp

Doh ;) Summary: our trick to use mapnik::value_null as a placeholder for Features that don’t actually have particular attribute is getting problematic. Specially if we want to maintain 1:1 matching of input and output (GeoJSON) A proper solution would be to abandon ​_schema_​ based features (aka Context) in favour of “scheme-less” Features like GeoJSON - back to how it used to be.

Or add “special” value type, lets call it “undefined” where value doesn’t exists. Latter feels like a hack and will muddy internal logic further so not sure.. Sounds like we need to sync before choosing a path

/cc @springmeyer @davidtheclark

artemp avatar Jul 14 '16 14:07 artemp

To distill even further: If we want to support "null" values (i think we do because they are valid in GeoJSON) then we can't rely on mapnik::value_null as a special default value for 'schema' (context) based feature implementation. /cc @springmeyer @davidtheclark

artemp avatar Jul 14 '16 14:07 artemp

> d={"name":null}
{ name: null }
> d["name"]
null
> d["undefined"]
undefined
> 

In JavaScript null is a distinct type that can be assigned ^ /cc @springmeyer @davidtheclark

artemp avatar Jul 14 '16 14:07 artemp

To properly support null values in JSON input and output we need to change mapnik::feature_impl - I'm planning to work on this next week.

Meanwhile, I added properties method which can optionally filter out null values as in feature json generator. This is OK because from data point of view following JSONs are equal

{ "first": 123, "second":null }
{ "first": 123}
var mapnik = require('mapnik');
mapnik.register_datasource('/opt/mapnik/lib/mapnik/input/geojson.input');
var ds = new mapnik.Datasource({type:'geojson',file:'test.json'});
var featureset = ds.featureset();
var feat = featureset.next();
while (feat)
{
    var props = feat.properties();
    console.log(props);
    console.log("------------NO NULL----------");
    props = feat.properties(false);
    console.log(props);
    console.log("-----------------------------");
    feat = featureset.next();
}

Sample output

{ astonishing: true,
  data: null,
  description: null,
  mixed: 7,
  name: 'foo',
  null: null,
  numbers: '[1,2,3]',
  onlyNull: null,
  power: 6 }
------------NO NULL----------
{ astonishing: true,
  mixed: 7,
  name: 'foo',
  numbers: '[1,2,3]',
  power: 6 }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: '7',
  name: 'bar',
  null: null,
  numbers: null,
  onlyNull: null,
  power: 99 }
------------NO NULL----------
{ mixed: '7', name: 'bar', power: 99 }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: null,
  name: 'baz',
  null: null,
  numbers: null,
  onlyNull: null,
  power: null }
------------NO NULL----------
{ name: 'baz' }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: null,
  name: 'haha',
  null: null,
  numbers: null,
  onlyNull: null,
  power: null }
------------NO NULL----------
{ name: 'haha' }
-----------------------------
{ astonishing: null,
  data: 6,
  description: null,
  mixed: null,
  name: 'hoho',
  null: null,
  numbers: '[1,2,3]',
  onlyNull: null,
  power: null }
------------NO NULL----------
{ data: 6, name: 'hoho', numbers: '[1,2,3]' }
-----------------------------
{ astonishing: null,
  data: '{foo:"bar"}',
  description: 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.',
  mixed: null,
  name: 'hehe',
  null: null,
  numbers: null,
  onlyNull: null,
  power: 86 }
------------NO NULL----------
{ data: '{foo:"bar"}',
  description: 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.',
  name: 'hehe',
  power: 86 }
-----------------------------

/cc @springmeyer @davidtheclark

artemp avatar Jul 15 '16 16:07 artemp

Thanks for plugging away at this @artemp.

davidtheclark avatar Jul 15 '16 16:07 davidtheclark

@davidtheclark: quick update: @artemp and I worked today to get the latest Mapnik core version (needed for his fix for you) working against latest node-mapnik. Still a few things needed to test, but getting close. Once there we'll be ready to further test the patch here and get you a dev build. ETA later this week. Feel free to ping for an update anytime you are thinking about this.

springmeyer avatar Jul 26 '16 16:07 springmeyer

@springmeyer Any update on this?

davidtheclark avatar Aug 11 '16 13:08 davidtheclark

@davidtheclark - dug into this issue just now. There is some other bug lurking so it is not as simple as updating the Mapnik core version like I thought (which is now done). I'm seeing unpredictable results when the geojson is indexed (index created with mapnik-index command) that makes me uncomfortable about the changes. So, this needs a closer look and I'm strapped on time at the moment. Can you live with your workaround for another week or two? If not hit me up on chat to scheme.

springmeyer avatar Aug 11 '16 23:08 springmeyer

@artemp as part of testing this (https://gist.github.com/springmeyer/5f9ad5831402587c258116f1e627895c) I also noticed that:

"data": { "foo": "bar" }

Is round tripping as "data":"{foo:\"bar\"}" in mapnik v3.0.12 but "data":"{\"foo\":\"bar\"}" in mapnik v.3.0.11. Although in mapnik v3.0.12 it also roundtrips as "data":"{\"foo\":\"bar\"}" when not indexed. It seems we have some different behavior when indexing that needs to be unit tested in core C++ with this file. Can you take a look at pulling the test file into mapnik core and testing how features are round tripped with and without indexing?

springmeyer avatar Aug 12 '16 04:08 springmeyer

@springmeyer - great catch before mapnik v3.0.12 release!

The issue was missing quoting for nested objects (https://github.com/mapnik/mapnik/issues/3491) and it's not related to indexing so please verify your setup ^. We already have comprehensive tests covering all JSON properties types but the expected data was bogus (corrected in https://github.com/mapnik/mapnik/commit/5c11fe49f9d428d140f73d513d43cebed4678a81) The issue is fixed in https://github.com/mapnik/mapnik/commit/8dca305e7ecd504cddebe5fd4f395011f56432bf

artemp avatar Aug 12 '16 10:08 artemp

Can you live with your workaround for another week or two?

Absolutely.

davidtheclark avatar Aug 12 '16 13:08 davidtheclark

@artemp @springmeyer Has there been any progress on this issue? Asking because I'm starting work on another project where this has come up.

davidtheclark avatar Nov 10 '16 14:11 davidtheclark

@davidtheclark In my mind what still stands here is the problem of null reporting. Mapnik creates (as it parses features) a "context". This "context" is a map of every property key found and an index to where it is. This allows for more efficient storage of strings to reduce memory in featuresets with a lot of features. The impact of this is that we loose the ability to decern null properties from missing properties. @artemp and I have previously felt this tradeoff was good, since perf/low memory > round tripping null features. However your usecase indicates round tripping nulls matters. The basic problem is we can't by design. Second you also saw inconsistency in when nulls were reported. This inconsistency has to do with two factors: 1) the order of features and 2) whether the data is indexed. In both cases the inconsistency is by design (based on the order things are encountered) and I think we've rooted out all the unintended bugs.

So overall I think the question is whether round tripping is important enough to change the way Mapnik works. The perf/memory optimizations that originally motivated the use of a "context" in Mapnik only apply when catching features in memory. Most modern use cases of Mapnik don't actually do this anymore, so we could consider changing this behavior. Let's schedule a time to talk more about this in 2017? In the meantime ponder if you think you can workaround this fully or if you'd like it fixed.

springmeyer avatar Dec 21 '16 19:12 springmeyer

So overall I think the question is whether round tripping is important enough to change the way Mapnik works.

I don't see a strong reason why the distinction between a null property and a missing property should matter anywhere mapnik-related. It's a language-specific thing; sure mapnik is tightly connected with JavaScript and Python, which do distinguish those; but if you wanted to bind to Lua or Perl, you'd have trouble. The GeoJSON spec requires "properties" to be either null or an object; the interpretation of its contents is entirely up to the application.

That being said...

This "context" is a map of every property key found and an index to where it is. This allows for more efficient storage of strings to reduce memory in featuresets with a lot of features.

... this becomes counter-productive when each feature has a handful of properties from a much larger set of possible properties. Think of fully expanding a sparse matrix. I don't know whether such use case is common, but it might be worth considering.

Regarding feature memory usage, replacing vector<value> with unordered_map<index_t, value> won't make it much higher because value can hold a UnicodeString with its SSO buffer. index_t can be unsigned, IMO there won't be a need for the range of size_t this century; or const std::string* (and then context can be a set instead of map).

This would make round-tripping possible, but I'd rather not store null values in the feature at all. get has to return null for missing properties, anyway.

lightmare avatar Dec 23 '16 17:12 lightmare