libpysal icon indicating copy to clipboard operation
libpysal copied to clipboard

Weights sprint planning

Open ljwolf opened this issue 2 years ago • 58 comments

I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the geographs branch, the weights/experimental module. The ROADMAP.md will need updating...

ljwolf avatar Jul 18 '23 08:07 ljwolf

Things we have right now:

  • Base class
  • Generic constructors (from_arrays, from_sparse, from_old_w (for ease of testing), from_weights_dict, from_dicts), all channeled through from_arrays.
  • Contiguity constructors (from_contiguity)
  • A bunch of properties
  • Transformation
  • Handling of islands
  • basics tooling for kernel constructors
  • unfinished tooling for triangulation constructors
  • some bit of set ops but @ljwolf will know the state of that better
  • spatial lag

What needs to be done:

  • hook kernel constructors to the base class
  • finish triangulation and hook to the base class
  • tests
    • tests of the functionality independent of current codebase
    • tests against the current implementations. these will be gone at some point but we need to ensure that the results from old and new match
  • I/O
  • plotting
  • lag_categorical
  • finish set operations
  • fuzzy_contiguity
  • bunch of there utils we have now

Where we're blocked by scipy.sparse development

  • components
  • higher_order

My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward.

martinfleis avatar Jul 18 '23 15:07 martinfleis

I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review.

martinfleis avatar Jul 18 '23 15:07 martinfleis

I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

ljwolf avatar Jul 18 '23 16:07 ljwolf

csgraph does seem to generally require matrix for now.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

Maybe as on-the-fly conversion when necessary?

martinfleis avatar Jul 18 '23 16:07 martinfleis

We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the kernel() function.

ljwolf avatar Jul 20 '23 14:07 ljwolf

the kernel() function may also need to implement a "threshold" option that lets you keep only points closer than threshold as pairs.

ljwolf avatar Jul 20 '23 14:07 ljwolf

Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint.

ljwolf avatar Jul 20 '23 15:07 ljwolf

TLDR:

  • whats a the intended signature for instantiating a kernel-based W from a dataframe?
  • whats the signature for instantiating a kernel-based W from a precomputed distance matrix?

I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s

so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a from_contiguity constructor, but i'm sure I heard Levi say we won't have such a thing in the new structure (and analogously we won't have from_kernel. So, while we need to "hook kernel constructors to the base class", that doesnt mean creating a from_kernel class method thats analogous to from_contiguity

Instead, the idea is that from_contiguity will become private, and we will create a larger class method that's from_dataframe? So like, currently, the kernel functions take an argument called coordinates which gets passed to a geometry validator. That means the input must be a geodataframe, not, say a distance matrix. So even though we want to allow distance=precomputed, that wont work because L109 is forcing an array of geoms

so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like

W.from_dataframe(**kernel_args?), because we want to avoid W.from_kernel(**input_data_args?)? if that's the case, then the primary purpose of W.from_dataframe() is to dispatch to helper functions that generate the correct data structure for different types of weights? Then it will end up subsuming from_contiguity and to create a queen weight, i'd do something like W.from_dataframe(gdf, 'contiguity', queen=True)?, conversely if i needed a kernel from precomputed distances id do something like W.from_csc(distance_matrix, 'kernel', kernel_function='gaussian')?, whereas if i needed it to build on the fly with euclidean distances i'd dl W.from_dataframe(gdf, 'kernel', kernel_function='gaussian'). This all applies to triangulation as well.

The alternative would be to leave desired output as first class, which would mean leaving from_contiguity intact, then adding analogous from_kernel, from_triangulation, etc

knaaptime avatar Jul 20 '23 16:07 knaaptime

I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry!

If from_contiguity is already there, run with that and add them for now.

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint?

ljwolf avatar Jul 20 '23 16:07 ljwolf

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like

W.from_dataframe(gdf, 'contiguity', queen=True)

that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types)

knaaptime avatar Jul 20 '23 17:07 knaaptime

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

knaaptime avatar Jul 20 '23 17:07 knaaptime

Maybe “from_dataframe” should be the constructor for an arbitrary adjacency table?

Get Outlook for iOShttps://aka.ms/o0ukef


From: eli knaap @.> Sent: Thursday, July 20, 2023 6:12:03 PM To: pysal/libpysal @.> Cc: Levi John Wolf @.>; Mention @.> Subject: Re: [pysal/libpysal] Weights sprint planning (Issue #534)

This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

— Reply to this email directly, view it on GitHubhttps://github.com/pysal/libpysal/issues/534#issuecomment-1644290716, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR4YZZN42OQ5YHBEZKW3XRFRGHANCNFSM6AAAAAA2OCBVBU. You are receiving this because you were mentioned.Message ID: @.***>

ljwolf avatar Jul 20 '23 17:07 ljwolf

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf? https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L25-L36

as later on geoms is passed for validation:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79

Not sure if the docs were to be updated or the check requires some preprocessing as geoms is undefined as currently implemented. This made me think the first argument is intended to be something other than documented at present?

sjsrey avatar Jul 20 '23 22:07 sjsrey

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

martinfleis avatar Jul 21 '23 06:07 martinfleis

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf?

The input should be able to be either an n,2 array of coordinates, a geoseries of POINT, or a numpy array with POINT shapely geometries. The input coordinates should be validated using:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_utils.py#L25-L29

and the output of that is used for the computation.

So, everywhere where we use _validate_geometry_input() in _triangulation, for example:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79

geoms should be coordinates

ljwolf avatar Jul 21 '23 10:07 ljwolf

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf) and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
  • my_w = kernel(my_gdf.geometry, k=3)
  • my_w = delaunay(my_coordinates).
  1. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
  • W.from_networkx(nx_graph)
  • W.from_file('./columbus.gal')
  • W.from_sparse(my_coo)
  • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
  • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

ljwolf avatar Jul 21 '23 10:07 ljwolf

We document/recommend

w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

martinfleis avatar Jul 21 '23 11:07 martinfleis

We document/recommend w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

From @ljwolf https://github.com/pysal/libpysal/issues/534#issuecomment-1644268455

I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

From @knaaptime https://github.com/pysal/libpysal/issues/534#issuecomment-1644278338

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

ljwolf avatar Jul 21 '23 11:07 ljwolf

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
  • my_w = kernel(my_gdf.geometry, k=3)
  • my_w = delaunay(my_coordinates).
  1. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
  • W.from_networkx(nx_graph)
  • W.from_file('./columbus.gal')
  • W.from_sparse(my_coo)
  • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
  • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and _from classmethods when you already have the encoding laid out in an existing data structure. Thats a change of pace, but it makes sense to me

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

maybe this. I do think it would make things easier to do from libpysal import weights and be able to tab complete to see there's a rook function, etc. That makes it a little clearer rook is definitely a weights builder

knaaptime avatar Jul 21 '23 15:07 knaaptime

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe? whats the signature for instantiating a kernel-based W from a precomputed distance matrix? (since we agree _from_kernel doesnt make sense)?

knaaptime avatar Jul 21 '23 16:07 knaaptime

kernel weight from a precomputed distance matrix

That should be

weights.kernel(D, metric="precomputed", kernel="gaussian")

ljwolf avatar Jul 21 '23 16:07 ljwolf

I originally assumed that it means from_kernel and from_triangulation methods. But if you don't think that it is a good idea let's figure out a better option.

I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code)

martinfleis avatar Jul 21 '23 16:07 martinfleis

This is what I had in mind.

    @classmethod
    def from_kernel(
        cls,
        data,
        bandwidth=None,
        metric="euclidean",
        kernel="gaussian",
        k=None,
        p=2,
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries over which to compute a kernel. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates. If metric="precomputed", data is assumed to contain a
            precomputed distance metric.
        bandwidth : float (default: None)
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates.
        metric : string or callable (default: 'euclidean')
            distance function to apply over the input coordinates. Supported options
            depend on whether or not scikit-learn is installed. If so, then any
            distance function supported by scikit-learn is supported here. Otherwise,
            only euclidean, minkowski, and manhattan/cityblock distances are admitted.
        kernel : string or callable (default: 'gaussian')
            kernel function to apply over the distance matrix computed by `metric`.
            The following kernels are supported:
                - triangular:
                - parabolic:
                - gaussian:
                - bisquare:
                - cosine:
                - boxcar/discrete: all distances less than `bandwidth` are 1, and all
                    other distances are 0
                - identity/None : do nothing, weight similarity based on raw distance
                - callable : a user-defined function that takes the distance vector and
                    the bandwidth and returns the kernel: kernel(distances, bandwidth)
        k : int (default: None)
            number of nearest neighbors used to truncate the kernel. This is assumed
            to be constant across samples. If None, no truncation is conduted.
        ids : numpy.narray (default: None)
            ids to use for each sample in coordinates. Generally, construction functions
            that are accessed via W.from_kernel() will set this automatically from
            the index of the input. Do not use this argument directly unless you intend
            to set the indices separately from your input data. Otherwise, use
            data.set_index(ids) to ensure ordering is respected. If None, then the index
            from the input coordinates will be used.
        p : int (default: 2)
            parameter for minkowski metric, ignored if metric != "minkowski".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally.

    @classmethod
    def from_triangulation(
        cls,
        data,
        method="delaunay",
        bandwidth=np.inf,
        kernel="boxcar",
        clip="extent",
        contiguity_type="rook",
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries containing locations to compute the
            delaunay triangulation. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates.
        method : str, (default "delaunay")
            method of extracting the weights from triangulation. Supports:
                - delaunay
                - gabriel
                - relative_neighborhood
                - voronoi
        bandwidth : _type_, optional
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates, by default numpy.inf
        kernel : str, optional
            kernel function to use in order to weight the output graph. See
            :meth:`W.from_kernel` for details. By default "boxcar"
        clip :str (default: 'bbox')
            An overloaded option about how to clip the voronoi cells passed to
            cg.voronoi_frames() when method="voronoi. Ignored otherwise.
            Default is ``'extent'``. Options are as follows.

            * ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
                arbitrarily larger that the source map. Note that this may lead to
                cells that are many orders of magnitude larger in extent than the
                original map. Not recommended.
            * ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
                the bounding box of the input points.
            * ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
                of the input points.
            * ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
                that contains all points (e.g. the smallest alphashape, using
                ``libpysal.cg.alpha_shape_auto``).
            * Polygon -- Clip to an arbitrary Polygon.
        contiguity_type : str, optional
            What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
            Ignored otherwise. Supports "rook" and "queen", by default "rook".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

@ljwolf I think that for voronoi, we don't need to expose rook and queen constructors and can go with the vertex method only because the diagram is always topologically precise.

Was there any discussion on these last time that concluded that this is not an optimal API?

martinfleis avatar Jul 26 '23 09:07 martinfleis

Was there any discussion on these last time that concluded that this is not an optimal API?

yeah, lets discuss the tradeoffs on friday. Given the initial layout in the experimental branch, what you sketched above was my original understanding too. But I think we decided the from_* nomenclature is a bit misleading. When you do from_kernel above, you pass the function a geodataframe, not a kernel. The same with triangulation; the first argument is either a geometry object or a list of coordinates, then an intermediate kernel/triangulation is constructed immediately before giving you a weights object. So it's not really weights object from_kernel, it's kernel from_dataframe to weights object.

I can see arguments both ways. It feels natural to do .from_kernel in line with from_contiguity, but the semantics of that aren't actually accurate

knaaptime avatar Jul 26 '23 22:07 knaaptime

Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually.

martinfleis avatar Jul 26 '23 22:07 martinfleis

So, on the sprint call, we considered two following APIs:

all constructors "live" on the W

In this example, the core design principle is:

W is self-contained. Most of what you need to do can be done only with the W object, not requiring the weights namespace.

The W.build_...() idiom constructs a W under a known graph construction rule, such as:

W.build_rook(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry)
W.build_kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

The W.from_...() idiom constructs a W from a specific datastructure

W.from_file("columbus.gal")
W.from_arrays(head, tail, weight)
W.from_sparse(my_coo_array)

To make good on this design decision, we should also implement a .higher_order(), a .lag(), etc. This is a pandas-style API, rather than expecting users to switch between functions in weights and methods on W. This allows for method-chaining, such as:

W.build_kernel(geoseries).transform('r').lag(geoseries.home_value)

all constructors "live" in a non-object namespace

In this example, the core design principle is

W is only a results container. Functions in weights operate on top of W objects and generally construct or mutate them.

We would have to implement a construct namespace, so that:

weights.construct.contiguity(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry) by default
# or the constructor actually lives at `weights.construct.contiguity`
weights.construct.kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

An ingest (convert?) namespace takes a dataframe and "ingests" it to a W:

weights.ingest.file("columbus.gal")
weights.ingest.arrays(head, tail, weight)
weights.ingest.sparse(my_coo_array)

and any of our functions in weights act on W objects to build (or mutate) W objects. This is a numpy-style API.

@knaaptime @martinfleis @ljwolf @rongboxu agree on the method strategy, option 1.

ljwolf avatar Jul 28 '23 14:07 ljwolf

On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too.

ljwolf avatar Jul 28 '23 14:07 ljwolf

For handling coincident points, we discussed the following options:

jitter input upon coincidence

randomly permute/jitter the input points if there is a coincident point

construct a clique

reduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule.

raise every time that there is a coincident point

detect that there is a coincident point and raise an error if that is the case.

@knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy.

All weights constructors should take a "coincident" argument that can take three values:

  • "raise" Default, raise an error when the coincident points would create an invalid output. This affects all Delaunay/Voronoi-type weights, and KNN when the number of duplicated points is greater than the requested k.
  • "clique" Compute the weights on unique locations in the data. Induce a "clique" into the graph, connecting all coincident points to one another and to the other values for that location.
  • "jitter" Add a very small random displacement (should use a fraction of the smallest pairwise distance) to all coordinates.

ljwolf avatar Jul 28 '23 15:07 ljwolf

@sjsrey @knaaptime @ljwolf @martinfleis agreed to rename W to Graph and move the namespace.

sjsrey avatar Jul 28 '23 23:07 sjsrey

I've just realised: for islands, the nonzero property will count them as a nonzero, since self.sparse.nnz counts "explicit" zeros as nonzero. This might throw off calculations in join counts, since a self-join is always going to contribute to positive autocorrelation (other cases, like Getis-Ord and Moran, we need it)...

We might want to think about using n_edges/n_nodes and then define nonzero (number of non-self loops) as self.n_edges - len(self.isolates)

ljwolf avatar Aug 11 '23 15:08 ljwolf