Categorical plots#

Bokeh offers multiple ways to handle and visualize categorical data. Categorical refers to data that can be divided into distinct groups or categories, with or without a natural order or numerical value. Examples include data representing countries, product names, or colors. Unlike continuous data, which might represent values like temperatures or distances, categorical data is about labeling and grouping.

Many data sets contain both continuous and categorical data. For example, a data set of the number of sales of different products in different countries over a period of time.

It is also possible for categorical data to have multiple values per category. Previous chapters such as Bar charts and Scatter plots have already introduced some of the ways to visualize categorical data with single values per category.

This chapter is focused on more complex categorical data with series of values for each category and data sets with one or multiple categorical variables.

One categorical variable#

Categorical scatter plots with jitter#

The chapter on scatter plots contains examples of visualizing data with single values for each category.

In case your data contains multiple values per category, you can visualize your data using a categorical scatter plot. This can be useful if you have different series of measurements for different days of the week, for example.

To avoid overlap between numerous scatter points for a single category, use the jitter() function to give each point a random offset.

The example below shows a scatter plot of every commit time for a GitHub user between 2012 and 2016. It uses days of the week as categories to groups the commits. By default, this plot would show thousands of points overlapping in a narrow line for each day. The jitter function lets you differentiate the points to produce a useful plot:

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.sampledata.commits import data
from bokeh.transform import jitter

DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']

source = ColumnDataSource(data)

p = figure(width=800, height=300, y_range=DAYS, x_axis_type='datetime',
           title="Commits by Time of Day (US/Central) 2012-2016")

p.scatter(x='time', y=jitter('day', width=0.6, range=p.y_range),  source=source, alpha=0.3)

p.xaxis.formatter.days = '%Hh'
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

show(p)

Categorical series with offsets#

A simple example of visualizing categorical data is using bar charts to represent a single value per category.

However, if you want to represent ordered series of data per category, you can use categorical offsets to position the glyphs for the values of each category. Other than visual offsets with dodge, categorical offsets afford explicit control over positioning “within” a category.

To supply an offset to a categorical location explicitly, add a numeric value to the end of a category. For example: ["Jan", 0.2] gives the category “Jan” an offset of 0.2.

For multi-level categories, add the value at the end of the existing list: ["West", "Sales", -0,2]. Bokeh interprets any numeric value at the end of a list of categories as an offset.

Take the fruit example from the “Bar charts” chapter and modify it by adding a list of offsets:

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']

offsets = [-0.5, -0.2, 0.0, 0.3, 0.1, 0.3]

# This results in [ ['Apples', -0.5], ['Pears', -0.2], ... ]
x = list(zip(fruits, offsets))

p.vbar(x=x, top=[5, 3, 4, 2, 4, 6], width=0.8)

This will shift each bar horizontally by the corresponding offset.

Below is a more sophisticated example of a ridge plot. It uses categorical offsets to specify patch coordinates for each category:

import colorcet as cc
from numpy import linspace
from scipy.stats import gaussian_kde

from bokeh.models import ColumnDataSource, FixedTicker, PrintfTickFormatter
from bokeh.plotting import figure, show
from bokeh.sampledata.perceptions import probly


def ridge(category, data, scale=20):
    return list(zip([category]*len(data), scale*data))

cats = list(reversed(probly.keys()))

palette = [cc.rainbow[i*15] for i in range(17)]

x = linspace(-20, 110, 500)

source = ColumnDataSource(data=dict(x=x))

p = figure(y_range=cats, width=900, x_range=(-5, 105), toolbar_location=None)

for i, cat in enumerate(reversed(cats)):
    pdf = gaussian_kde(probly[cat])
    y = ridge(cat, pdf(x))
    source.add(y, cat)
    p.patch('x', cat, color=palette[i], alpha=0.6, line_color="black", source=source)

p.outline_line_color = None
p.background_fill_color = "#efefef"

p.xaxis.ticker = FixedTicker(ticks=list(range(0, 101, 10)))
p.xaxis.formatter = PrintfTickFormatter(format="%d%%")

p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis.ticker

p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None

p.y_range.range_padding = 0.12

show(p)

Slopegraphs#

Slopegraphs are plots for visualizing the relative change between two or more data points. This can be useful to visualize the difference between two categories or the change over time of a variable within a category, for example.

In a slopegraph, you visualize individual measurements as dots arranged into two columns and indicate pairings by connecting the paired dots with a line. The slope of each line highlights the magnitude and direction of change.

The following slopegraph visualizes the relative change in CO2 emissions per person in different countries over a period of years or decades. It uses the Segment glyph to draw the line connecting the paired dots:

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.sampledata.emissions import data

countries = [
    "Trinidad and Tobago",
    "Qatar",
    "United Arab Emirates",
    "Oman",
    "Bahrain",
    "Singapore",
    "Netherlands Antilles",
    "Kazakhstan",
    "Equatorial Guinea",
    "Kuwait",
]

df = data[data["country"].isin(countries)].reset_index(drop=True)
years = (df["year"] == 2000.0) | (df["year"] == 2010.0)

df = df[years].reset_index(drop=True)
df["year"] = df.year.astype(int)
df["year"] = df.year.astype(str)

# create separate columns for the different years
a = df[df["year"] == "2000"]
b = df[df["year"] == "2010"]
new_df = a.merge(b, on="country")

source = ColumnDataSource(new_df)

p = figure(x_range=("2000", "2010"), y_range=(0, 60), x_axis_location="above", y_axis_label="CO2 emissions (tons / person)")

p.scatter(x="year_x", y="emissions_x", source=source, size=7)

p.scatter(x="year_y", y="emissions_y", source=source, size=7)

p.segment(x0="year_x", y0="emissions_x", x1="year_y", y1="emissions_y", source=source, color="black")

p.text(x="year_y", y="emissions_y", text="country", source=source, x_offset=7, y_offset=8, text_font_size="12px")

p.xaxis.major_tick_line_color = None
p.xaxis.major_tick_out = 0
p.xaxis.axis_line_color = None

p.yaxis.minor_tick_out = 0
p.yaxis.major_tick_in = 0
p.yaxis.ticker = [0, 20, 40, 60]

p.grid.grid_line_color = None
p.outline_line_color = None

show(p)

Two or more categorical variables#

Categorical Heatmaps#

It is possible to have values associated with pairs of categories. In this situation, applying different color shades to rectangles that represent a pair of categories will produce a categorical heatmap. This is a plot with two categorical axes.

The following plot lists years from 1948 to 2016 on its x-axis and months of the year on the y-axis. Each rectangle of the plot corresponds to a (year, month) pair. The color of the rectangle indicates the rate of unemployment in a given month of a given year.

This example uses linear_cmap() to map the colors of the plot because the unemployment rate is a continuous variable. This plot also uses construct_color_bar() to provide a visual legend on the right:

from math import pi

import pandas as pd

from bokeh.models import BasicTicker, PrintfTickFormatter
from bokeh.plotting import figure, show
from bokeh.sampledata.unemployment1948 import data
from bokeh.transform import linear_cmap

data['Year'] = data['Year'].astype(str)
data = data.set_index('Year')
data.drop('Annual', axis=1, inplace=True)
data.columns.name = 'Month'

years = list(data.index)
months = list(reversed(data.columns))

# reshape to 1D array or rates with a month and year for each row.
df = pd.DataFrame(data.stack(), columns=['rate']).reset_index()

# this is the colormap from the original NYTimes plot
colors = ["#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41", "#550b1d"]

TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"

p = figure(title=f"US Unemployment ({years[0]} - {years[-1]})",
           x_range=years, y_range=months,
           x_axis_location="above", width=900, height=400,
           tools=TOOLS, toolbar_location='below',
           tooltips=[('date', '@Month @Year'), ('rate', '@rate%')])

p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "7px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = pi / 3

r = p.rect(x="Year", y="Month", width=1, height=1, source=df,
           fill_color=linear_cmap("rate", colors, low=df.rate.min(), high=df.rate.max()),
           line_color=None)

p.add_layout(r.construct_color_bar(
    major_label_text_font_size="7px",
    ticker=BasicTicker(desired_num_ticks=len(colors)),
    formatter=PrintfTickFormatter(format="%d%%"),
    label_standoff=6,
    border_line_color=None,
    padding=5,
), 'right')

show(p)

The following periodic table uses several of the techniques in this chapter:

from bokeh.plotting import figure, show
from bokeh.sampledata.periodic_table import elements
from bokeh.transform import dodge, factor_cmap

periods = ["I", "II", "III", "IV", "V", "VI", "VII"]
groups = [str(x) for x in range(1, 19)]

df = elements.copy()
df["atomic mass"] = df["atomic mass"].astype(str)
df["group"] = df["group"].astype(str)
df["period"] = [periods[x-1] for x in df.period]
df = df[df.group != "-"]
df = df[df.symbol != "Lr"]
df = df[df.symbol != "Lu"]

cmap = {
    "alkali metal"         : "#a6cee3",
    "alkaline earth metal" : "#1f78b4",
    "metal"                : "#d93b43",
    "halogen"              : "#999d9a",
    "metalloid"            : "#e08d49",
    "noble gas"            : "#eaeaea",
    "nonmetal"             : "#f1d4Af",
    "transition metal"     : "#599d7A",
}

TOOLTIPS = [
    ("Name", "@name"),
    ("Atomic number", "@{atomic number}"),
    ("Atomic mass", "@{atomic mass}"),
    ("Type", "@metal"),
    ("CPK color", "$color[hex, swatch]:CPK"),
    ("Electronic configuration", "@{electronic configuration}"),
]

p = figure(title="Periodic Table (omitting LA and AC Series)", width=1000, height=450,
           x_range=groups, y_range=list(reversed(periods)),
           tools="hover", toolbar_location=None, tooltips=TOOLTIPS)

r = p.rect("group", "period", 0.95, 0.95, source=df, fill_alpha=0.6, legend_field="metal",
           color=factor_cmap('metal', palette=list(cmap.values()), factors=list(cmap.keys())))

text_props = dict(source=df, text_align="left", text_baseline="middle")

x = dodge("group", -0.4, range=p.x_range)

p.text(x=x, y="period", text="symbol", text_font_style="bold", **text_props)

p.text(x=x, y=dodge("period", 0.3, range=p.y_range), text="atomic number",
       text_font_size="11px", **text_props)

p.text(x=x, y=dodge("period", -0.35, range=p.y_range), text="name",
       text_font_size="7px", **text_props)

p.text(x=x, y=dodge("period", -0.2, range=p.y_range), text="atomic mass",
       text_font_size="7px", **text_props)

p.text(x=["3", "3"], y=["VI", "VII"], text=["LA", "AC"], text_align="center", text_baseline="middle")

p.outline_line_color = None
p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_standoff = 0
p.legend.orientation = "horizontal"
p.legend.location ="top_center"
p.hover.renderers = [r] # only hover element boxes

show(p)

Correlograms#

When you have more than three to four quantitative variables per category, it can be more useful to quantify the amount of association between pairs of variables and visualize this quantity rather than the raw data. One common way to do this is to calculate correlation coefficients. Visualizations of correlation coefficients are called correlograms.

The following correlogram is another good example of the techniques in this chapter.

This plot displays the correlations as colored circles. The scale of the circles corresponds to the absolute value of the correlation coefficient. This way, low correlations are suppressed and high correlations stand out better.

This example uses linear_cmap() to map the colors of the plot in order to highlight the correlations between the pair of elements. This mapper is also uses construct_color_bar() to provide a visual legend below:

from itertools import combinations

import numpy as np
import pandas as pd

from bokeh.models import ColumnDataSource, FixedTicker
from bokeh.plotting import figure, show
from bokeh.sampledata.forensic_glass import data as df
from bokeh.transform import linear_cmap

elements = ("Mg", "Ca", "Fe", "K", "Na", "Al", "Ba")
pairs = list(combinations(elements, 2))

correlations = []
for x, y in pairs:
    matrix = np.corrcoef(df[x], df[y])
    correlations.append(matrix[0, 1])

x, y = list(zip(*pairs))

new_df = pd.DataFrame({
    "oxide_1": x,
    "oxide_2": y,
    "correlation": correlations,
    "dot_size": [(1+ 10 * abs(corr)) * 10 for corr in correlations],
})

x_range = new_df["oxide_1"].unique()
y_range = list(new_df["oxide_2"].unique())

source = ColumnDataSource(new_df)

p = figure(x_axis_location="above", toolbar_location=None, x_range=x_range, y_range=y_range, background_fill_color="#fafafa")

c = p.scatter(x="oxide_1", y="oxide_2", size="dot_size", source=source, fill_color=linear_cmap("correlation", "RdYlGn9", -0.5, 0.5), line_color="#202020")

color_bar = c.construct_color_bar(
    location=(200, 0),
    ticker=FixedTicker(ticks=[-0.5, 0.0, 0.5]),
    title="correlation",
    major_tick_line_color=None,
    width=150,
    height=20,
)

p.add_layout(color_bar, "below")

p.axis.major_tick_line_color = None
p.axis.major_tick_out = 0
p.axis.axis_line_color = None
p.grid.grid_line_color = None
p.outline_line_color = None

show(p)