Handling Categorical Data¶
Data Representation¶
Pandas Integrations¶
Note
Several examples in this chapter use Pandas, for ease of presentation
and because it is a common tool for data manipulation. However, Pandas
is not required to create anything shown here.
Range Padding¶
Bars¶
Basic¶
Bokeh make it simple to create basic bar charts using the
hbar()
and
vbar()
glyphs methods. In the example
below, we have the following sequence of simple 1-level factors:
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
To inform Bokeh that the x-axis is categorical, we pass this list of factors
as the x_range
argument to :fund:~bokeh.plotting.figure.figure
:
p = figure(x_range=fruits, ... )
Note that passing the list of factors is a convenient shorthand notation for
creating a FactorRange
. The equivalent explicit
notation is:
p = figure(x_range=FactorRange(field=fruits), ... )
This more explicit for is useful when you want to customize the
FactorRange
, e.g. by changing the
Range Padding.
Next we can call vbar
with the list of fruit name factors as the x
coordinate, the bar height as the top
coordinate, and optionally any
width
or other properties that we would like to set:
p.vbar(x=fruits, top=[5, 3, 4, 2, 4, 6], width=0.9)
All put together, we see the output:
from bokeh.io import show, output_file
from bokeh.plotting import figure
output_file("bars.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
p = figure(x_range=fruits, plot_height=250, title="Fruit Counts",
toolbar_location=None, tools="")
p.vbar(x=fruits, top=[5, 3, 4, 2, 4, 6], width=0.9)
p.xgrid.grid_line_color = None
p.y_range.start = 0
show(p)
As usual, the data could also be put into a ColumnDataSource
supplied as
the source
parameter to vbar
instead of passing the data directly
as parameters. The next example will demonstrate this.
Colors¶
Often times we may want to have bars that are shaded some color. This can be
accomplished in different ways. One way is to supply all the colors up front.
This can be done by putting all the data, including the colors for each bar,
in a ColumnDataSource
. Then the name of the column containing the colors
is passed to figure
as the color
(or line_color
/fill_color
)
arguments. This is shown below:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6
from bokeh.plotting import figure
output_file("colormapped_bars.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
counts = [5, 3, 4, 2, 4, 6]
source = ColumnDataSource(data=dict(fruits=fruits, counts=counts, color=Spectral6))
p = figure(x_range=fruits, y_range=(0,9), plot_height=250, title="Fruit Counts",
toolbar_location=None, tools="")
p.vbar(x='fruits', top='counts', width=0.9, color='color', legend="fruits", source=source)
p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
show(p)
Another way to shade the bars is to use a CategoricalColorMapper
that
colormaps the bars inside the browser. There is a function
factor_cmap()
that makes this simple to do:
factor_cmap('fruits', palette=Spectral6, factors=fruits))
This can be passed to figure
in the same way as the column name in the
previous example. Putting everything together we obtain the same plot in
a different way:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
output_file("colormapped_bars.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
counts = [5, 3, 4, 2, 4, 6]
source = ColumnDataSource(data=dict(fruits=fruits, counts=counts))
p = figure(x_range=fruits, plot_height=250, toolbar_location=None, title="Fruit Counts")
p.vbar(x='fruits', top='counts', width=0.9, source=source, legend="fruits",
line_color='white', fill_color=factor_cmap('fruits', palette=Spectral6, factors=fruits))
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.y_range.end = 9
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
show(p)
Grouped¶
When creating bar charts, it is often desirable to visually display the data according to sub-groups. There are two basic methods that can be used, depending on your use case: using nested categorical coordinates, or applying vidual dodges.
Nested Categories¶
If the coordinates of a plot range and data have two or three levels, then Bokeh will automatically group the factors on the axis, including a hierarchical tick labeling with separators between the groups. In the case of bar charts, this results in bars grouped together by the top-level factors. This is probably the most common way to achieve grouped bars, especially if you are starting from “tidy” data.
The example below shows this approach by creating a single column of
coordinates that are each 2-tuples of the form (fruit, year)
. Accordingly,
the plot groups the axes by fruit type, with a single call to vbar
:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure
output_file("bars.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ['2015', '2016', '2017']
data = {'fruits' : fruits,
'2015' : [2, 1, 4, 3, 2, 4],
'2016' : [5, 3, 3, 2, 4, 6],
'2017' : [3, 2, 4, 4, 5, 3]}
# this creates [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]
x = [ (fruit, year) for fruit in fruits for year in years ]
counts = sum(zip(data['2015'], data['2016'], data['2017']), ()) # like an hstack
source = ColumnDataSource(data=dict(x=x, counts=counts))
p = figure(x_range=FactorRange(*x), plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar(x='x', top='counts', width=0.9, source=source)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None
show(p)
We can also apply a color mapping, similar to the earlier example. To obtain
same grouped bar plot of fruits data as above, except with the bars shaded by
the year, changethe vbar
function call to use factor_cmap
for the
fill_color
:
p.vbar(x='x', top='counts', width=0.9, source=source, line_color="white",
# use the palette to colormap based on the the x[1:2] values
fill_color=factor_cmap('x', palette=palette, factors=years, start=1, end=2))
Recall that the factors are of the for (fruit, year)
. The start=1
and end=2
in the call to factor_cmap
select the second part of data
factors to use when color mapping.
Visual Dodge¶
Another method for achieving grouped bars is to explicitly specify a visual displacement for the bars. Such a visual offset is also referred to as a dodge.
In this scenario, our data is not “tidy”. Instead a single table with
rows indexed by factors (fruit, year)
, we have separate series for each
year. We can plot all the year series using separate calls to vbar
but
since every bar in each group has the same fruit
factor, the bars would
overlap visually. We can prevent this overlap and distinguish the bars
visually by using the dodge()
function to provide an
offset for each different call to vbar
:
from bokeh.core.properties import value
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.transform import dodge
output_file("dodged_bars.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ['2015', '2016', '2017']
data = {'fruits' : fruits,
'2015' : [2, 1, 4, 3, 2, 4],
'2016' : [5, 3, 3, 2, 4, 6],
'2017' : [3, 2, 4, 4, 5, 3]}
source = ColumnDataSource(data=data)
p = figure(x_range=fruits, y_range=(0, 10), plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar(x=dodge('fruits', -0.25, range=p.x_range), top='2015', width=0.2, source=source,
color="#c9d9d3", legend=value("2015"))
p.vbar(x=dodge('fruits', 0.0, range=p.x_range), top='2016', width=0.2, source=source,
color="#718dbf", legend=value("2016"))
p.vbar(x=dodge('fruits', 0.25, range=p.x_range), top='2017', width=0.2, source=source,
color="#e84d60", legend=value("2017"))
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
Stacked¶
Another common operation or bar charts is to stack bars on top of one
another. Bokeh makes this easy to do with the specialized
hbar_stack()
and
vbar_stack()
functions. The example
below shows the fruits data from above, but with the bars for each
fruit type stacked instead of grouped:
from bokeh.core.properties import value
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
output_file("stacked.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ["2015", "2016", "2017"]
colors = ["#c9d9d3", "#718dbf", "#e84d60"]
data = {'fruits' : fruits,
'2015' : [2, 1, 4, 3, 2, 4],
'2016' : [5, 3, 4, 2, 4, 6],
'2017' : [3, 2, 4, 4, 5, 3]}
source = ColumnDataSource(data=data)
p = figure(x_range=fruits, plot_height=250, title="Fruit Counts by Year",
toolbar_location=None, tools="")
p.vbar_stack(years, x='fruits', width=0.9, color=colors, source=source,
legend=[value(x) for x in years])
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
Note that behing the scenes, these functions work by stacking up the
successive columns in separate calls to vbar
or hbar
. This kind of
operation is akin the to dodge example above (i.e. the data in this case is
not in a “tidy” data format).
Sometimes we may want to stack bars that have both positive and negative extents. The example below shows how it is possible to create such a stacked bar chart that is split by positive and negative values:
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource
from bokeh.palettes import GnBu3, OrRd3
from bokeh.plotting import figure
output_file("stacked_split.html")
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ["2015", "2016", "2017"]
exports = {'fruits' : fruits,
'2015' : [2, 1, 4, 3, 2, 4],
'2016' : [5, 3, 4, 2, 4, 6],
'2017' : [3, 2, 4, 4, 5, 3]}
imports = {'fruits' : fruits,
'2015' : [-1, 0, -1, -3, -2, -1],
'2016' : [-2, -1, -3, -1, -2, -2],
'2017' : [-1, -2, -1, 0, -2, -2]}
p = figure(y_range=fruits, plot_height=250, x_range=(-16, 16), title="Fruit import/export, by year",
toolbar_location=None)
p.hbar_stack(years, y='fruits', height=0.9, color=GnBu3, source=ColumnDataSource(exports),
legend=["%s exports" % x for x in years])
p.hbar_stack(years, y='fruits', height=0.9, color=OrRd3, source=ColumnDataSource(imports),
legend=["%s imports" % x for x in years])
p.y_range.range_padding = 0.1
p.ygrid.grid_line_color = None
p.legend.location = "top_left"
p.axis.minor_tick_line_color = None
p.outline_line_color = None
show(p)
Mixed Factors¶
When dealing with hierarchical categories of two or three levels, it’s possible to use just the “higher level” portion of a coordinate to position glyphs. For example, if you have range with the hierarchical factors
factors = [
("East", "Sales"), ("East", "Marketing"), ("East", "Dev"),
("West", "Sales"), ("West", "Marketing"), ("West", "Dev"),
]
Then it is possible to use just “Sales” and “Marketing” etc. as positions
for glyphs. In this case the position is the center of the entire group. The
example below shows bars for each month, grouped by financial quarter, and
also adds a line (perhaps for a quarterly average) at the coordinates for
Q1
, Q2
, etc.:
from bokeh.io import show, output_file
from bokeh.models import FactorRange
from bokeh.plotting import figure
output_file("mixed.html")
factors = [
("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"),
("Q2", "apr"), ("Q2", "may"), ("Q2", "jun"),
("Q3", "jul"), ("Q3", "aug"), ("Q3", "sep"),
("Q4", "oct"), ("Q4", "nov"), ("Q4", "dec"),
]
p = figure(x_range=FactorRange(*factors), plot_height=250,
toolbar_location=None, tools="")
x = [ 10, 12, 16, 9, 10, 8, 12, 13, 14, 14, 12, 16 ]
p.vbar(x=factors, top=x, width=0.9, alpha=0.5)
p.line(x=["Q1", "Q2", "Q3", "Q4"], y=[12, 9, 13, 14], color="red", line_width=2)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None
show(p)
This example also demonstrates that other glyphs such as lines also function with categorical coordinates.
Pandas¶
Pandas is a powerful and common tool for doing data analysis on tabular and timeseries data in Python. Although it is not required by Bokeh, Bokeh tries to make life easier when you do.
Below is a plot that demonstrates some advantages when using Pandas with Bokeh:
- Pandas
GroupBy
objects can be used to initialize aCoumnDataSource
, automatically creating columns for many statistical measures such as the group mean or count GroupBy
objects may also be passed directly as a range argument tofigure
.
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg as df
from bokeh.transform import factor_cmap
output_file("groupby.html")
df.cyl = df.cyl.astype(str)
group = df.groupby('cyl')
source = ColumnDataSource(group)
cyl_cmap = factor_cmap('cyl', palette=Spectral5, factors=sorted(df.cyl.unique()))
p = figure(plot_height=350, x_range=group, title="MPG by # Cylinders",
toolbar_location=None, tools="")
p.vbar(x='cyl', top='mpg_mean', width=1, source=source,
line_color=cyl_cmap, fill_color=cyl_cmap)
p.y_range.start = 0
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "some stuff"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
Not that in the example above, we grouped by the column 'cyl'
so our CDS
has a column 'cyl'
for this index. Additionally, other non-grouped columns
like 'mpg'
have had associated columns such 'mpg_mean'
added, that
give the mean MPG value for each group.
This usage also works when the grouping is multi-level. The example below shows
how grouping the same data by ('cyl', 'mfr')
results in a hierarchical
nested axis. In this case, the index column name 'cyl_mfr'
is made by
joining the names of the grouped columns together.
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.palettes import Spectral5
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bars.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(('cyl', 'mfr'))
source = ColumnDataSource(group)
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tools="")
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=source,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
p.add_tools(HoverTool(tooltips=[("MPG", "@mpg_mean"), ("Cyl, Mfr", "@cyl_mfr")]))
show(p)
Intervals¶
So far we have seen the bar glyphs used to create bar charts, which imply bars drawn from a common baseline. However, the bar glyphs can also be used to represent arbitrary intervals across a range.
The example below uses hbar
with both left
and right
properties
supplied, to show the spread in times between bronze and gold medalists in
Olympic sprinting over many years:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.sampledata.sprint import sprint
output_file("sprint.html")
sprint.Year = sprint.Year.astype(str)
group = sprint.groupby('Year')
source = ColumnDataSource(group)
p = figure(y_range=group, x_range=(9.5,12.7), plot_width=400, plot_height=550, toolbar_location=None,
title="Time Spreads for Sprint Medalists (by Year)")
p.hbar(y="Year", left='Time_min', right='Time_max', height=0.4, source=source)
p.ygrid.grid_line_color = None
p.xaxis.axis_label = "Time (seconds)"
p.outline_line_color = None
show(p)
Scatters¶
Adding Jitter¶
When plotting many scatter points in a single categorical category, it is
common for points to start to visually overlap. In this case, Bokeh provides
a jitter()
function that can automatically apply
a random dodge to every point.
The example below shows a scatter plot of every commit time for a GitHub user
between 2012 and 2016, grouped by day of the week. A naive plot of this data
would result in thousands of points overlapping in a narrow line for each day.
By using jitter
we can differentiate the points to obtain a useful plot:
from bokeh.io import show, output_file
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.sampledata.commits import data
from bokeh.transform import jitter
output_file("bars.html")
DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']
source = ColumnDataSource(data)
p = figure(plot_width=800, plot_height=300, y_range=DAYS, x_axis_type='datetime',
title="Commits by Time of Day (US/Central) 2012—2016")
p.circle(x='time', y=jitter('day', width=0.6, range=p.y_range), source=source, alpha=0.3)
p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None
show(p)
Heat Maps¶
In all of the cases above, we have had one categorical axis, and one continuous axis. It is possible to have plots with two categorical axes. If we shade the rectangle that defines each pair of categories, we end up with a Categorical Heatmap
The plot below shows such a plot, where the x-axis categories are a list of
years from 1948 to 2016, and the y-axis categories are the months of the
years. Each rectangle corresponding to a (year, month)
combination is
color mapped by the unemployment rate for that month and year. Since the
unemployment rate is a continuous variable, a LinearColorMapper
is used
to colormap the plot, and is also passed to a color bar to provide a visual
legend on the right:
import pandas as pd
from bokeh.io import output_file, show
from bokeh.models import BasicTicker, ColorBar, ColumnDataSource, LinearColorMapper, PrintfTickFormatter
from bokeh.plotting import figure
from bokeh.sampledata.unemployment1948 import data
from bokeh.transform import transform
output_file("unemploymemt.html")
data.Year = data.Year.astype(str)
data = data.set_index('Year')
data.drop('Annual', axis=1, inplace=True)
data.columns.name = 'Month'
# reshape to 1D array or rates with a month and year for each row.
df = pd.DataFrame(data.stack(), columns=['rate']).reset_index()
source = ColumnDataSource(df)
# this is the colormap from the original NYTimes plot
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]
mapper = LinearColorMapper(palette=colors, low=df.rate.min(), high=df.rate.max())
p = figure(plot_width=800, plot_height=300, title="US Unemployment 1948—2016",
x_range=list(data.index), y_range=list(reversed(data.columns)),
toolbar_location=None, tools="", x_axis_location="above")
p.rect(x="Year", y="Month", width=1, height=1, source=source,
line_color=None, fill_color=transform('rate', mapper))
color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
ticker=BasicTicker(desired_num_ticks=len(colors)),
formatter=PrintfTickFormatter(format="%d%%"))
p.add_layout(color_bar, 'right')
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "5pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0
show(p)
A final example combines many of the techniques in this chapter: color mappers, visual dodges, and Pandas DataFrames. These are used to create a different sort of “heatmap” that results in a periodic table of the elements. A hover tool as also been added so that additional information about each element can be inspected:
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.sampledata.periodic_table import elements
from bokeh.transform import dodge, factor_cmap
output_file("periodic.html")
periods = ["I", "II", "III", "IV", "V", "VI", "VII"]
groups = [str(x) for x in range(1, 19)]
df = elements.copy()
df["atomic mass"] = df["atomic mass"].astype(str)
df["group"] = df["group"].astype(str)
df["period"] = [periods[x-1] for x in df.period]
df = df[df.group != "-"]
df = df[df.symbol != "Lr"]
df = df[df.symbol != "Lu"]
cmap = {
"alkali metal" : "#a6cee3",
"alkaline earth metal" : "#1f78b4",
"metal" : "#d93b43",
"halogen" : "#999d9a",
"metalloid" : "#e08d49",
"noble gas" : "#eaeaea",
"nonmetal" : "#f1d4Af",
"transition metal" : "#599d7A",
}
source = ColumnDataSource(df)
p = figure(plot_width=900, plot_height=500, title="Periodic Table (omitting LA and AC Series)",
x_range=groups, y_range=list(reversed(periods)), toolbar_location=None, tools="")
p.rect("group", "period", 0.95, 0.95, source=source, fill_alpha=0.6, legend="metal",
color=factor_cmap('metal', palette=list(cmap.values()), factors=list(cmap.keys())))
text_props = {"source": source, "text_align": "left", "text_baseline": "middle"}
x = dodge("group", -0.4, range=p.x_range)
r = p.text(x=x, y="period", text="symbol", **text_props)
r.glyph.text_font_style="bold"
r = p.text(x=x, y=dodge("period", 0.3, range=p.y_range), text="atomic number", **text_props)
r.glyph.text_font_size="8pt"
r = p.text(x=x, y=dodge("period", -0.35, range=p.y_range), text="name", **text_props)
r.glyph.text_font_size="5pt"
r = p.text(x=x, y=dodge("period", -0.2, range=p.y_range), text="atomic mass", **text_props)
r.glyph.text_font_size="5pt"
p.text(x=["3", "3"], y=["VI", "VII"], text=["LA", "AC"], text_align="center", text_baseline="middle")
p.add_tools(HoverTool(tooltips = [
("Name", "@name"),
("Atomic number", "@{atomic number}"),
("Atomic mass", "@{atomic mass}"),
("Type", "@metal"),
("CPK color", "$color[hex, swatch]:CPK"),
("Electronic configuration", "@{electronic configuration}"),
]))
p.outline_line_color = None
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_standoff = 0
p.legend.orientation = "horizontal"
p.legend.location ="top_center"
show(p)