Weights sprint planning
I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the geographs branch, the weights/experimental module. The ROADMAP.md will need updating...
Things we have right now:
- Base class
- Generic constructors (
from_arrays,from_sparse,from_old_w(for ease of testing),from_weights_dict,from_dicts), all channeled throughfrom_arrays. - Contiguity constructors (
from_contiguity) - A bunch of properties
- Transformation
- Handling of islands
- basics tooling for kernel constructors
- unfinished tooling for triangulation constructors
- some bit of set ops but @ljwolf will know the state of that better
- spatial lag
What needs to be done:
- hook kernel constructors to the base class
- finish triangulation and hook to the base class
-
tests
- tests of the functionality independent of current codebase
- tests against the current implementations. these will be gone at some point but we need to ensure that the results from old and new match
- I/O
- plotting
- lag_categorical
- finish set operations
- fuzzy_contiguity
- bunch of there utils we have now
Where we're blocked by scipy.sparse development
- components
- higher_order
My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward.
I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review.
I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow.
If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?
csgraph does seem to generally require matrix for now.
If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?
Maybe as on-the-fly conversion when necessary?
We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the kernel() function.
the kernel() function may also need to implement a "threshold" option that lets you keep only points closer than threshold as pairs.
Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint.
TLDR:
- whats a the intended signature for instantiating a kernel-based W from a dataframe?
- whats the signature for instantiating a kernel-based W from a precomputed distance matrix?
I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s
so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a from_contiguity constructor, but i'm sure I heard Levi say we won't have such a thing in the new structure (and analogously we won't have from_kernel. So, while we need to "hook kernel constructors to the base class", that doesnt mean creating a from_kernel class method thats analogous to from_contiguity
Instead, the idea is that from_contiguity will become private, and we will create a larger class method that's from_dataframe? So like, currently, the kernel functions take an argument called coordinates which gets passed to a geometry validator. That means the input must be a geodataframe, not, say a distance matrix. So even though we want to allow distance=precomputed, that wont work because L109 is forcing an array of geoms
so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like
W.from_dataframe(**kernel_args?), because we want to avoid W.from_kernel(**input_data_args?)? if that's the case, then the primary purpose of W.from_dataframe() is to dispatch to helper functions that generate the correct data structure for different types of weights? Then it will end up subsuming from_contiguity and to create a queen weight, i'd do something like W.from_dataframe(gdf, 'contiguity', queen=True)?, conversely if i needed a kernel from precomputed distances id do something like W.from_csc(distance_matrix, 'kernel', kernel_function='gaussian')?, whereas if i needed it to build on the fly with euclidean distances i'd dl W.from_dataframe(gdf, 'kernel', kernel_function='gaussian'). This all applies to triangulation as well.
The alternative would be to leave desired output as first class, which would mean leaving from_contiguity intact, then adding analogous from_kernel, from_triangulation, etc
I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry!
If from_contiguity is already there, run with that and add them for now.
From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint?
From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?
yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".
so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like
W.from_dataframe(gdf, 'contiguity', queen=True)
that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types)
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
Maybe “from_dataframe” should be the constructor for an arbitrary adjacency table?
Get Outlook for iOShttps://aka.ms/o0ukef
From: eli knaap @.> Sent: Thursday, July 20, 2023 6:12:03 PM To: pysal/libpysal @.> Cc: Levi John Wolf @.>; Mention @.> Subject: Re: [pysal/libpysal] Weights sprint planning (Issue #534)
This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
— Reply to this email directly, view it on GitHubhttps://github.com/pysal/libpysal/issues/534#issuecomment-1644290716, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR4YZZN42OQ5YHBEZKW3XRFRGHANCNFSM6AAAAAA2OCBVBU. You are receiving this because you were mentioned.Message ID: @.***>
@ljwolf question on the signature of delaunay
Should the first arg be an array or a geoseries/gdf? https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L25-L36
as later on geoms is passed for validation:
https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79
Not sure if the docs were to be updated or the check requires some preprocessing as geoms is undefined as currently implemented. This made me think the first argument is intended to be something other than documented at present?
I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.
I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.
I'm not sure I properly understand the problem though.
@ljwolf question on the signature of
delaunayShould the first arg be an array or a geoseries/gdf?
The input should be able to be either an n,2 array of coordinates, a geoseries of POINT, or a numpy array with POINT shapely geometries. The input coordinates should be validated using:
https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_utils.py#L25-L29
and the output of that is used for the computation.
So, everywhere where we use _validate_geometry_input() in _triangulation, for example:
https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79
geoms should be coordinates
I don't think having one
from_dataframemethod is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.I think that constructors like
from_triangulationcould easily accept both GeoPandas objects and an array of coords.or do we take a generic w from input W.from_dataframe(gdf) and define a conversion to_kernel, to_voronoi, etc?
No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.
I'm not sure I properly understand the problem though.
My preferred option would be:
- the constructor functions (
kernel(),vertex_set_intersection(),rook(),delaunay(),gabriel()) takeGeoSeriesor arrays of coordinates, and return plainWobjects. So, calls look like
-
my_w = kernel(my_gdf.geometry, k=3) -
my_w = delaunay(my_coordinates).
- the
Wclass itself usesfrom_*to read/convert existing objects into aWstructure. So, calls look like:
-
W.from_networkx(nx_graph) -
W.from_file('./columbus.gal') -
W.from_sparse(my_coo) -
W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength') -
W.from_arrays(my_df.here, my_df.there, my_df.strength)
No W.from_dataframe(), so that there's no confusion about what "dataframe" means.
So, instead of
W.from_dataframe(gdf, 'contiguity', queen=True)
We document/recommend
w = weights.contiguity.rook(gdf)
We document/recommend
w = weights.contiguity.rook(gdf)
Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
We document/recommend w = weights.contiguity.rook(gdf)
Mmm, not sure I am fan of that. The solution with
from_contiguityand similar is elegant in a sense that you always create W with someW.from*constructor and you don't have fish in the documentation and different sub modules how to create one.
What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
From @ljwolf https://github.com/pysal/libpysal/issues/534#issuecomment-1644268455
I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?
From @knaaptime https://github.com/pysal/libpysal/issues/534#issuecomment-1644278338
yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".
My preferred option would be:
- the constructor functions (
kernel(),vertex_set_intersection(),rook(),delaunay(),gabriel()) takeGeoSeriesor arrays of coordinates, and return plainWobjects. So, calls look like
my_w = kernel(my_gdf.geometry, k=3)my_w = delaunay(my_coordinates).
- the
Wclass itself usesfrom_*to read/convert existing objects into aWstructure. So, calls look like:
W.from_networkx(nx_graph)W.from_file('./columbus.gal')W.from_sparse(my_coo)W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')W.from_arrays(my_df.here, my_df.there, my_df.strength)No
W.from_dataframe(), so that there's no confusion about what "dataframe" means.So, instead of
W.from_dataframe(gdf, 'contiguity', queen=True)We document/recommend
w = weights.contiguity.rook(gdf)
i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and _from classmethods when you already have the encoding laid out in an existing data structure. Thats a change of pace, but it makes sense to me
What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?
maybe this. I do think it would make things easier to do from libpysal import weights and be able to tab complete to see there's a rook function, etc. That makes it a little clearer rook is definitely a weights builder
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe? whats the signature for instantiating a kernel-based W from a precomputed distance matrix? (since we agree _from_kernel doesnt make sense)?
kernel weight from a precomputed distance matrix
That should be
weights.kernel(D, metric="precomputed", kernel="gaussian")
I originally assumed that it means from_kernel and from_triangulation methods. But if you don't think that it is a good idea let's figure out a better option.
I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code)
This is what I had in mind.
@classmethod
def from_kernel(
cls,
data,
bandwidth=None,
metric="euclidean",
kernel="gaussian",
k=None,
p=2,
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries over which to compute a kernel. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates. If metric="precomputed", data is assumed to contain a
precomputed distance metric.
bandwidth : float (default: None)
distance to use in the kernel computation. Should be on the same scale as
the input coordinates.
metric : string or callable (default: 'euclidean')
distance function to apply over the input coordinates. Supported options
depend on whether or not scikit-learn is installed. If so, then any
distance function supported by scikit-learn is supported here. Otherwise,
only euclidean, minkowski, and manhattan/cityblock distances are admitted.
kernel : string or callable (default: 'gaussian')
kernel function to apply over the distance matrix computed by `metric`.
The following kernels are supported:
- triangular:
- parabolic:
- gaussian:
- bisquare:
- cosine:
- boxcar/discrete: all distances less than `bandwidth` are 1, and all
other distances are 0
- identity/None : do nothing, weight similarity based on raw distance
- callable : a user-defined function that takes the distance vector and
the bandwidth and returns the kernel: kernel(distances, bandwidth)
k : int (default: None)
number of nearest neighbors used to truncate the kernel. This is assumed
to be constant across samples. If None, no truncation is conduted.
ids : numpy.narray (default: None)
ids to use for each sample in coordinates. Generally, construction functions
that are accessed via W.from_kernel() will set this automatically from
the index of the input. Do not use this argument directly unless you intend
to set the indices separately from your input data. Otherwise, use
data.set_index(ids) to ensure ordering is respected. If None, then the index
from the input coordinates will be used.
p : int (default: 2)
parameter for minkowski metric, ignored if metric != "minkowski".
Returns
-------
W
libpysal.weights.experimental.W
"""
This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally.
@classmethod
def from_triangulation(
cls,
data,
method="delaunay",
bandwidth=np.inf,
kernel="boxcar",
clip="extent",
contiguity_type="rook",
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries containing locations to compute the
delaunay triangulation. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates.
method : str, (default "delaunay")
method of extracting the weights from triangulation. Supports:
- delaunay
- gabriel
- relative_neighborhood
- voronoi
bandwidth : _type_, optional
distance to use in the kernel computation. Should be on the same scale as
the input coordinates, by default numpy.inf
kernel : str, optional
kernel function to use in order to weight the output graph. See
:meth:`W.from_kernel` for details. By default "boxcar"
clip :str (default: 'bbox')
An overloaded option about how to clip the voronoi cells passed to
cg.voronoi_frames() when method="voronoi. Ignored otherwise.
Default is ``'extent'``. Options are as follows.
* ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
arbitrarily larger that the source map. Note that this may lead to
cells that are many orders of magnitude larger in extent than the
original map. Not recommended.
* ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
the bounding box of the input points.
* ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
of the input points.
* ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
that contains all points (e.g. the smallest alphashape, using
``libpysal.cg.alpha_shape_auto``).
* Polygon -- Clip to an arbitrary Polygon.
contiguity_type : str, optional
What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
Ignored otherwise. Supports "rook" and "queen", by default "rook".
Returns
-------
W
libpysal.weights.experimental.W
"""
@ljwolf I think that for voronoi, we don't need to expose rook and queen constructors and can go with the vertex method only because the diagram is always topologically precise.
Was there any discussion on these last time that concluded that this is not an optimal API?
Was there any discussion on these last time that concluded that this is not an optimal API?
yeah, lets discuss the tradeoffs on friday. Given the initial layout in the experimental branch, what you sketched above was my original understanding too. But I think we decided the from_* nomenclature is a bit misleading. When you do from_kernel above, you pass the function a geodataframe, not a kernel. The same with triangulation; the first argument is either a geometry object or a list of coordinates, then an intermediate kernel/triangulation is constructed immediately before giving you a weights object. So it's not really weights object from_kernel, it's kernel from_dataframe to weights object.
I can see arguments both ways. It feels natural to do .from_kernel in line with from_contiguity, but the semantics of that aren't actually accurate
Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually.
So, on the sprint call, we considered two following APIs:
all constructors "live" on the W
In this example, the core design principle is:
Wis self-contained. Most of what you need to do can be done only with theWobject, not requiring theweightsnamespace.
The W.build_...() idiom constructs a W under a known graph construction rule, such as:
W.build_rook(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry)
W.build_kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)
The W.from_...() idiom constructs a W from a specific datastructure
W.from_file("columbus.gal")
W.from_arrays(head, tail, weight)
W.from_sparse(my_coo_array)
To make good on this design decision, we should also implement a .higher_order(), a .lag(), etc. This is a pandas-style API, rather than expecting users to switch between functions in weights and methods on W. This allows for method-chaining, such as:
W.build_kernel(geoseries).transform('r').lag(geoseries.home_value)
all constructors "live" in a non-object namespace
In this example, the core design principle is
Wis only a results container. Functions inweightsoperate on top ofWobjects and generally construct or mutate them.
We would have to implement a construct namespace, so that:
weights.construct.contiguity(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry) by default
# or the constructor actually lives at `weights.construct.contiguity`
weights.construct.kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)
An ingest (convert?) namespace takes a dataframe and "ingests" it to a W:
weights.ingest.file("columbus.gal")
weights.ingest.arrays(head, tail, weight)
weights.ingest.sparse(my_coo_array)
and any of our functions in weights act on W objects to build (or mutate) W objects. This is a numpy-style API.
@knaaptime @martinfleis @ljwolf @rongboxu agree on the method strategy, option 1.
On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too.
For handling coincident points, we discussed the following options:
jitter input upon coincidence
randomly permute/jitter the input points if there is a coincident point
construct a clique
reduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule.
raise every time that there is a coincident point
detect that there is a coincident point and raise an error if that is the case.
@knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy.
All weights constructors should take a "coincident" argument that can take three values:
-
"raise"Default, raise an error when the coincident points would create an invalid output. This affects all Delaunay/Voronoi-type weights, and KNN when the number of duplicated points is greater than the requestedk. -
"clique"Compute the weights on unique locations in the data. Induce a "clique" into the graph, connecting all coincident points to one another and to the other values for that location. -
"jitter"Add a very small random displacement (should use a fraction of the smallest pairwise distance) to all coordinates.
@sjsrey @knaaptime @ljwolf @martinfleis agreed to rename W to Graph and move the namespace.
I've just realised: for islands, the nonzero property will count them as a nonzero, since self.sparse.nnz counts "explicit" zeros as nonzero. This might throw off calculations in join counts, since a self-join is always going to contribute to positive autocorrelation (other cases, like Getis-Ord and Moran, we need it)...
We might want to think about using n_edges/n_nodes and then define nonzero (number of non-self loops) as self.n_edges - len(self.isolates)