Providing Data for Plots and Tables¶
No data visualization is possible without the underlying data to be represented.
In this section, the various ways of providing data for plots is explained, from
passing data values directly to creating a ColumnDataSource
and filtering using
a CDSView
.
Providing data directly¶
In Bokeh, it is possible to pass lists of values directly into plotting functions.
In the example below, the data, x_values
and y_values
, are passed directly
to the circle
plotting method (see Plotting with Basic Glyphs for more examples).
from bokeh.plotting import figure
x_values = [1, 2, 3, 4, 5]
y_values = [6, 7, 2, 3, 6]
p = figure()
p.circle(x=x_values, y=y_values)
When you pass in data like this, Bokeh works behind the scenes to make a
ColumnDataSource
for you. But learning to create and use the ColumnDataSource
will enable you access more advanced capabilites, such as streaming data,
sharing data between plots, and filtering data.
ColumnDataSource¶
The ColumnDataSource
is the core of most Bokeh plots, providing the data
that is visualized by the glyphs of the plot. With the ColumnDataSource
,
it is easy to share data between multiple plots and widgets, such as the
DataTable
. When the same ColumnDataSource
is used to drive multiple
renderers, selections of the data source are also shared. Thus it is possible
to use a select tool to choose data points from one plot and have them automatically
highlighted in a second plot (Linked selection).
At the most basic level, a ColumnDataSource
is simply a mapping between column
names and lists of data. The ColumnDataSource
takes a data
parameter which is a dict,
with string column names as keys and lists (or arrays) of data values as values. If one positional
argument is passed in to the ColumnDataSource
initializer, it will be taken as data
. Once the
ColumnDataSource
has been created, it can be passed into the source
parameter of
plotting methods which allows you to pass a column’s name as a stand in for the data values:
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
data = {'x_values': [1, 2, 3, 4, 5],
'y_values': [6, 7, 2, 3, 6]}
source = ColumnDataSource(data=data)
p = figure()
p.circle(x='x_values', y='y_values', source=source)
The data
parameter can also be a Pandas DataFrame
or GroupBy
object.
source = ColumnDataSource(df)
If a DataFrame
is used, the CDS will have columns corresponding to the columns of
the DataFrame
. If the DataFrame
has a named index column, then CDS will also have
a column with this name. However, if the index name (or any subname of a MultiIndex
)
is None
, then the CDS will have a column generically named index
for the index.
group = df.groupby(('colA', 'ColB'))
source = ColumnDataSource(group)
If a GroupBy
object is used, the CDS will have columns corresponding to the result of
calling group.describe()
. The describe
method generates columns for statistical measures
such as mean
and count
for all the non-grouped orginal columns. The CDS columns are
formed by joining original column names with the computed measure. For example, if a
DataFrame
has columns 'year'
and 'mpg'
. Then passing df.groupby('year')
to a CDS will result in columns such as 'mpg_mean'
Note this capability to adapt GroupBy
objects may only work with Pandas >=0.20.0
.
Note
There is an implicit assumption that all the columns in a given ColumnDataSource
all have the same length at all times. For this reason, it is usually preferable to
update the .data
property of a data source “all at once”.
Streaming¶
ColumnDataSource
streaming is an efficient way to append new data to a CDS. By using the
stream
method, Bokeh only sends new data to the browser instead of the entire dataset.
The stream
method takes a new_data
parameter containing a dict mapping column names
to sequences of data to be appended to the respective columns. It additionally takes an optional
argument rollover
, which is the maximum length of data to keep (data from the beginning of the
column will be discarded). The default rollover
value of None allows data to grow unbounded.
source = ColumnDataSource(data=dict(foo=[], bar=[]))
# has new, identical-length updates for all columns in source
new_data = {
'foo' : [10, 20],
'bar' : [100, 200],
}
source.stream(new_data)
For an example that uses streaming, see examples/app/ohlc.
Patching¶
ColumnDataSource
patching is an efficient way to update slices of a data source. By using the
patch
method, Bokeh only needs to send new data to the browser instead of the entire dataset.
The patch
method should be passed a dict mapping column names to list of tuples that represent
a patch change to apply.
The tuples that describe patch changes are of the form:
(index, new_value) # replace a single column value
# or
(slice, new_values) # replace several column values
For a full example, see examples/howto/patch_app.py.
Filtering data with CDSView¶
It’s often desirable to focus in on a portion of data that has been subsampled or filtered from a larger dataset. Bokeh allows you to specify a view of a data source that represents a subset of data. By having a view of the data source, the underlying data doesn’t need to be changed and can be shared across plots. The view consists of one or more filters that select the rows of the data source that should be bound to a specific glyph.
To plot with a subset of data, you can create a CDSView
and pass it in as a view
argument to the renderer-adding methods on the Figure
, such as figure.circle
. The
CDSView
has two properties, source
and filters
. source
is the ColumnDataSource
that the view is associated with. filters
is a list of Filter
objects, listed and
described below.
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, CDSView
source = ColumnDataSource(some_data)
view = CDSView(source=source, filters=[filter1, filter2])
p = figure()
p.circle(x="x", y="y", source=source, view=view)
IndexFilter¶
The IndexFilter
is the simplest filter type. It has an indices
property which is a
list of integers that are the indices of the data you want to be included in the plot.
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, CDSView, IndexFilter
from bokeh.layouts import gridplot
output_file("index_filter.html")
source = ColumnDataSource(data=dict(x=[1, 2, 3, 4, 5], y=[1, 2, 3, 4, 5]))
view = CDSView(source=source, filters=[IndexFilter([0, 2, 4])])
tools = ["box_select", "hover", "reset"]
p = figure(plot_height=300, plot_width=300, tools=tools)
p.circle(x="x", y="y", size=10, hover_color="red", source=source)
p_filtered = figure(plot_height=300, plot_width=300, tools=tools)
p_filtered.circle(x="x", y="y", size=10, hover_color="red", source=source, view=view)
show(gridplot([[p, p_filtered]]))
BooleanFilter¶
A BooleanFilter
selects rows from a data source through a list of True or False values
in its booleans
property.
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, CDSView, BooleanFilter
from bokeh.layouts import gridplot
output_file("boolean_filter.html")
source = ColumnDataSource(data=dict(x=[1, 2, 3, 4, 5], y=[1, 2, 3, 4, 5]))
booleans = [True if y_val > 2 else False for y_val in source.data['y']]
view = CDSView(source=source, filters=[BooleanFilter(booleans)])
tools = ["box_select", "hover", "reset"]
p = figure(plot_height=300, plot_width=300, tools=tools)
p.circle(x="x", y="y", size=10, hover_color="red", source=source)
p_filtered = figure(plot_height=300, plot_width=300, tools=tools,
x_range=p.x_range, y_range=p.y_range)
p_filtered.circle(x="x", y="y", size=10, hover_color="red", source=source, view=view)
show(gridplot([[p, p_filtered]]))
GroupFilter¶
The GroupFilter
allows you to select rows from a dataset that have a specific value for
a categorical variable. The GroupFilter
has two properties, column_name
, the name of
column in the ColumnDataSource
, and group
, the value of the column to select for.
In the example below, flowers
contains a categorical variable species
which is
either setosa
, versicolor
, or virginica
.
from bokeh.plotting import figure, output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, CDSView, GroupFilter
from bokeh.sampledata.iris import flowers
output_file("group_filter.html")
source = ColumnDataSource(flowers)
view1 = CDSView(source=source, filters=[GroupFilter(column_name='species', group='versicolor')])
plot_size_and_tools = {'plot_height': 300, 'plot_width': 300,
'tools':['box_select', 'reset', 'help']}
p1 = figure(title="Full data set", **plot_size_and_tools)
p1.circle(x='petal_length', y='petal_width', source=source, color='black')
p2 = figure(title="Setosa only", x_range=p1.x_range, y_range=p1.y_range, **plot_size_and_tools)
p2.circle(x='petal_length', y='petal_width', source=source, view=view1, color='red')
show(gridplot([[p1, p2]]))
CustomJSFilter¶
You can also create a CustomJSFilter
with your own functionality. To do this, use JavaScript
or CoffeeScript to write code that returns either a list of indices or a list of
booleans that represents the filtered subset. The ColumnDataSource
that is associated
with the CDSView
this filter is added to will be available at render time with the
variable source
.
Javascript¶
To create a CustomJSFilter
with custom functionality written in JavaScript,
pass in the JavaScript code as a string to the parameter code
:
custom_filter = CustomJSFilter(code='''
var indices = [];
// iterate through rows of data source and see if each satisfies some constraint
for (var i = 0; i <= source.get_length(); i++){
if (source.data['some_column'][i] == 'some_value'){
indices.push(true);
} else {
indices.push(false);
}
}
return indices;
''')
Coffeescript¶
You can also write code for the CustomJSFilter
in CoffeeScript, and
use the from_coffeescript
class method, which accepts the code
parameter:
custom_filter_coffee = CustomJSFilter.from_coffeescript(code='''
z = source.data['z']
indices = (i for i in [0...source.get_length()] when z[i] == 'b')
return indices
''')
AjaxDataSource¶
Bokeh server applications make it simple to update and stream data to data
sources, but sometimes it is desirable to have similar functionality in
standalone documents. The AjaxDataSource
provides this cabability.
The AjaxDataSource
is configured with a URL to a REST endoint and a
polling interval. In the browser, the data source will request data from the
endpoint at the specified interval and update the data locally. Existing
data may either be replaced entirely, or appened to (up to a configurable
max_size
). The endpoint that is supplied should return a JSON dict that
matches the standard ColumnDataSource
format:
{
'x' : [1, 2, 3, ...],
'y' : [9, 3, 2, ...]
}
Otherwise, using an AjaxDataSource
is identical to using a standard
ColumnDataSource
:
source = AjaxDataSource(data_url='http://some.api.com/data',
polling_interval=100)
# Use just like a ColumnDataSource
p.circle('x', 'y', source=source)
A full example (shown below) can be seen at examples/howto/ajax_source.py
Linked selection¶
Using the same ColumnDataSource
in the two plots below allows their selections to be
shared.
from bokeh.io import output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
output_file("brushing.html")
x = list(range(-20, 21))
y0 = [abs(xx) for xx in x]
y1 = [xx**2 for xx in x]
# create a column data source for the plots to share
source = ColumnDataSource(data=dict(x=x, y0=y0, y1=y1))
TOOLS = "box_select,lasso_select,help"
# create a new plot and add a renderer
left = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
left.circle('x', 'y0', source=source)
# create another new plot and add a renderer
right = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
right.circle('x', 'y1', source=source)
p = gridplot([[left, right]])
show(p)
Linked selection with filtered data¶
With the ability to specify a subset of data to be used for each glyph renderer, it is
easy to share data between plots even when the plots use different subsets of data.
By using the same ColumnDataSource
, selections and hovered inspections of that data source
are automatically shared.
In the example below, a CDSView
is created for the second plot that specifies the subset
of data in which the y values are either greater than 250 or less than 100. Selections in either
plot are automatically reflected in the other. And hovering on a point in one plot will highlight
the corresponding point in the other plot if it exists.
from bokeh.plotting import figure, output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, CDSView, BooleanFilter
output_file("linked_selection_subsets.html")
x = list(range(-20, 21))
y0 = [abs(xx) for xx in x]
y1 = [xx**2 for xx in x]
# create a column data source for the plots to share
source = ColumnDataSource(data=dict(x=x, y0=y0, y1=y1))
# create a view of the source for one plot to use
view = CDSView(source=source, filters=[BooleanFilter([True if y > 250 or y < 100 else False for y in y1])])
TOOLS = "box_select,lasso_select,hover,help"
# create a new plot and add a renderer
left = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
left.circle('x', 'y0', size=10, hover_color="firebrick", source=source)
# create another new plot, add a renderer that uses the view of the data source
right = figure(tools=TOOLS, plot_width=300, plot_height=300, title=None)
right.circle('x', 'y1', size=10, hover_color="firebrick", source=source, view=view)
p = gridplot([[left, right]])
show(p)
Other Data Types¶
Bokeh also has the capability to render network graph data and geographical data. For more information about how to set up the data for these types of plots, see Visualizing Network Graphs and Mapping Geo Data.