datashader
datashader copied to clipboard
Example of plotting points with associated probabilities
Currently, datashader's scatterplot/heatmap approach for points data partitions the set of points, allocating each one into non-overlapping pixel-shaped bins. Some types of data come with associated probabilities, such as a known measurement error bound or an estimated uncertainty per point.
It would be good to have an example of how to aggregate such data, such that the value of each datapoint is assigned to multiple bins in the aggregate array, according to some kernel function (e.g. a 2D Gaussian, where errors are specified as stddevs).
For the special case of a square error kernel, this approach is equivalent to implementing support for raster data (see #86), where each raster datapoint represents a specified area of the X,Y plane with equal probability or weighting within that square.
We'll need a suitable dataset of this type, preferably one with widely varying error estimates across the datapoints, such that some points have tight bounds and others are less constrained.
Thank you, @jbednar . Two questions.
First: Will this feature help to crossplot data like this: X Y VAL 1 1 0.2 2 1 0.3 ... 1 2 0.3 2 2 0.4 .... 5 5 1.0
Where for each pair (X,Y) there are unique value VAL. And the result is a scatter plot of these points colored by some mapping of VAL to RGB?
Basically equivalent of
df.plot(kind='scatter', x='X', y='Y', c='VAL', s=50);
Second: Is (or will be) there any way to define size of the points in datashader?
Thanks!
We're working on making point sizing be more flexible and automatic, and on properly documenting how to do it, but in the meantime you can apply the tf.spread
function on your final image, as shown in this notebook:
https://gist.github.com/jcrist/62b366727886561356d8
The code is already available for the application you describe above; just pass the field you want to the appropriate aggregation function:
cvs = ds.Canvas(plot_width=800, plot_height=500, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'X', 'Y', ds.mean('VAL'))
img = tf.interpolate(agg, low="white", high='darkblue', how='linear')
where mean
tells datashader that you want to average the VAL of all points falling into that pixel; you could instead take the max, median, etc.
Thanks, @jbednar .
I was able to colorize my plot - thanks for the example! It was quite easy and my understanding of datashader got more solid!
But it looks like tf.spread
is not available (version from conda -c conda
) - I guess I need use the github version instead...
Oops, yes -- spread requires the Github master version.
Thanks,
When I run
import datashader as ds
I get this error:
OSError: [Errno 13] Permission denied: '/opt/dist/anaconda/lib/python2.7/site-packages/datashader-0.1.0-py2.7.egg/datashader/__pycache__'
DatashaderImportError.txt The reason is that I install this package as system-admin, but I run it as my regular user. Is there anyway to prohibit any file creations like that in your library? Or at least isolate them so that one user is not affecting other user.
The version from conda -c conda never had this problem.
For now - I just gave rwx permissions for all users to datashader directory and it seems to work. Other than that - all the features are perfect! Thank you!
P.S. I'm curious if by design of spread API shape
+ px
= mask
. Then Why wouldn't you just generalize shape
parameter to accept numpy masks
and then just ignore px
in that case... Or even beter - somehow scale the mask based on px... but I'm just curious - no demanding here :-)
I don't think that issues with __pycache__
would be due to datashader per se, as we don't access that directly ourselves (though it looks like the separate numba library that we use does access it). So I'd assume that there's a different way to install it that would avoid permissions errors, but I don't know how you originally installed it, and thus what change to suggest.
For the shape, we often want to specify a circular mask at different radius values, which the px
argument makes easy to do; it would be painful to make a new mask for every px value we wanted to try. Yes, scaling the mask based on the px value would be handy, but there are lots of ways to scale matrices, and so we'd rather leave that up to the user to do based on any of the many libraries available for that.
The reason is that I install this package as system-admin, but I run it as my regular user. Is there anyway to prohibit any file creations like that in your library? Or at least isolate them so that one user is not affecting other user.
We started caching code compilation in numba, which writes a cache file on first import. I've filed an issue, see https://github.com/numba/numba/issues/1771.
For now, try running python -c "import datashader"
with admin privileges after install. This should cause the compilation to happen once (and you have permission to write those files). Subsequent imports should only read the cache, which should be fine.
That all makes sense! Thank you for the ticket at numba - I'll watch it.
In the comment above, tf.interpolate
is deprecated. The new code would be:
cvs = ds.Canvas(plot_width=800, plot_height=500, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'X', 'Y', ds.mean('VAL'))
img = tf.shade(agg, cmap=["white", 'darkblue'], how='linear')
Hi! I have been trying to use this method for plotting data points with associated probabilities/weights, but bumped into something I do not understand. If I pass all zeros values in the column used as the weighting factor, I expect the image to become empty. Yet it does not! Is it a bug or am I misunderstanding something?
Below is minimal code to reproduce it with datashader 0.13.0:
import datashader as ds
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
num_datapoints = 1000
xs = 200 * np.random.rand(num_datapoints)
ys = 200 * np.random.rand(num_datapoints)
weights = np.random.rand(num_datapoints)
# Uncommenting the line below should probably
# result in a black image, yet it doesn't?
# weights = np.zeros((num_datapoints,))
df = pd.DataFrame(np.array([xs, ys, weights]).T, columns=['x', 'y', 'weight'])
cvs = ds.Canvas(plot_width=200, plot_height=200, x_range=(0, 200), y_range=(0, 200))
agg = cvs.points(df, 'x', 'y', ds.sum('weight'))
img = ds.tf.shade(agg, cmap='white')
plt.imshow(img, origin='lower', cmap='gray')
plt.show()
And below is what I see if I uncomment the line that sets all the weights to zero.
In my other work the outputs of cvs.points(df, 'x', 'y', ds.sum('weight'))
and a Matplotlib scatter plot with the weights used as colors or sizes look very different at the moment, so maybe I'm misunderstanding how it is supposed to work in Datashader. I assume using the ds.sum('weight')
aggregator would make the brightness of each bin/pixel equal to the sum of the weights for data points that land in that bin.
@naavis If you look at the contents of agg
when you are using your zero weights you will see that it contains two values, 0
and np.nan
. Zeros correspond to where you have data points that has a weight of zero, np.nan
where there are no data points. If there is only a single finite data value in agg
, it is mapped to the top end of the cmap
, hence white.
Secondly, your combination of ds.tf.shade()
and plt.imshow()
is almost certainly not doing what you want. ds.tf.shade()
outputs a 200x200 array containing RGBA values that are encoded into uint32
, and if you pass an MxN array to imshow
it will treat is as scalar data and apply a colormap. Hence you are applying a colormap twice. I recommend for debug purposes replacing your matplotlib code with a call to ds.util.export_image()
and it should all be easier to understand.
Anyway, this is really a usage question and should have been posted to https://discourse.holoviz.org/ rather than being appended to a 6-year old github issue. If you have further questions about this, please could you ask on the discourse instead. Thanks!
Thanks, and sorry. This Github issue was the only place I found mentioning using data point specific weights/probabilities with Datashader. The documentation isn't exactly abundant on this: https://datashader.org/user_guide/Points.html https://datashader.org/api.html#definitions
I was not aware of the Discourse page. I'll post any further thoughts there.