Gallery of Data Visualization

Michael Friendly
Statistical Consulting Service and Psychology Department
York University

Like good writing, good graphical displays of data communicate ideas with clarity, precision, and efficiency. Like poor writing, bad graphical displays distort or obscure the data, make it harder to understand or compare, or otherwise thwart the communicative effect which the graph should convey.

This Gallery of Data Visualization displays some examples of the Best and Worst in statistical graphics, with the view that the contrast may be useful, inform current practice, and provide some pointers to current work. The term "Data Visualization" means rather different things to people in disciplines. The emphasis here is somewhat more on the Statistical Graphics side than on the side reflected in, say, the comp.graphics.visualization newsgroup. Do you know of other examples of the Best or Worst in Statistical Graphics on the Web? Let me know through this Image submission form.

The page is organized as a collection of images, along with a few of the 1000 words each may be worth and some links to original sources. To reduce transmission time, most of the images are presented as thumbnails, with links to larger originals. Click on the thumnail image or on the words "Full size".

The page is best viewed with an HTML3-capable browser.

See also:

Laurels

Historical milestones

Picture Words
Mathematica graphic of Minard's depiction of the fate of Napoleon's army, (from Shaw & Tigg, 1993). Full size (297x183), or see Minard's original, Full size (569x273) [30K] (scanned from Tufte, 1983).

The French engineer, Charles Minard, illustrated the disastrous result of Napoleon's failed Russian campaign of 1812. The graph shows the size of the army by the width of the band across the map of the campaign on its outward and return legs, with temperature on the retreat shown on the line graph at the bottom. Many consider Minard's original the best statistical graphic ever drawn.

The first image shown here is drawn by a Mathematica function, NapoleonicMarchOnMoscowAndBackAgainPlot[], a tour-de-force of Mathematica graphics, described by Shaw and Tigg in Chapter 4 of Applied Mathematica: Getting Started, Getting it Done.

Playfair's charts. (a) Balance of Trade, Full size (510x356) [41K]; (b) Prices, wages and reigns, Full size (504x267) [109K] (from Julie Scott's William Playfair page)

William Playfair (1759-1823) is generally viewed as the inventor of most of the common graphical forms used to display data: the scatterplot, line plots, bar chart and pie chart. His The Commercial and Political Atlas, published in 1786, contained a number of interesting time-series charts such as these.

In the first, the area between two time-series curves was emphasized to show the difference between them, representing the balance of trade. The second graph plots three parallel time series: prices, wages, and the reigns of British kings and queens. Among the benefits of graphical display, Playfair said, "On inspecting any one of these Charts attentively, a sufficiently distinct impression will be made, to remain unimpaired for a considerable time, and the idea which does remain will be simple and complete, at once including the duration and the amount,"

Florence Nightingale's Coxcomb, Full size (486x501) [5K]

Florence Nightingale is remembered as the mother of modern nursing, but few realize that her place in history is at least partly linked to her use, following Playfair, of graphical methods to convey complex statistical information dramatically to a broad audience.

After witnessing deplorable sanitary conditions in the Crimea, she wrote Notes on Matters Affecting the Health, Efficiency and Hospital Administration of the British Army (1858), including several graphs of her own design, which she called "Coxcombs". This figure (reproduced with SAS/GRAPH) makes it abundantly clear that far more deaths were attributable to non-battle than to battle-related causes.

Aside from its historical interest, Nightingale's Coxcomb is notable for its display of frequency by area, like the pie chart. Unlike the pie chart, the Coxcomb keeps angles constant and varies radius (proportional to sqrt(frequency)), a principle used in the FourFold Display for 2x2xk tables.

The 1854 London Cholera Epidemic. Full size (368x320)

One of the first uses of a map to display epidemiological data was this dot chart (from Tufte, 1983, p. 24) by Dr. James Snow showing deaths from cholera (dots) in relation to the locations of public water pumps. Tufte says, "Snow observed that cholera occurred almost entirely among those who lived near (and drank from) the Broad Street water pump. He had the handle of the contaminated pump removed, ending the neighborhood epidemic which had taken more than 500 lives."

Album de statistique graphique. Full size (246x326) [19K]

By the mid 1800s, many new forms of statistical graphics were being used to display data of economic and national interest in England and in Framce. This image, brought to my attention by Antoine de Falguerolles, is one of many beautiful statistical maps of France from the Album de statistique graphique.

Bright ideas

Boxplot of the NJ Pick-it Lottery. Full size (640x495) (from the S-PLUS book)

An important principle of graphical display is to focus the viewers attention on what should be seen in the data. Tukey's boxplot suppresses all detail in the middle of a distribution, but displays individual observations in the extremes, where they may need to be noticed.

This boxplot shows the payoff of winning numbers in the New Jersey Pick-It Lottery, grouped by leading digit of the winning number. Players pick a 3-digit number, and the payoff is divided by the number of winners. The graph shows clearly that payoffs for numbers 000-099 are substantially higher, presumably because fewer people picked numbers less than 100.

US Visibility Map, Full size (531 x 335)

Data maps, particularly of the United States, are difficult to do because the area of each geographic unit serves as the visual container for the data to be displayed; our visual understanding of the data is confounded with the geographic boundaries.

One solution, suggested by Mark Monmonier ( How to Lie With Maps), is to use a schematic map which partially equates the areas of geographic regions. The resulting Visibility Map sacrifices some visual fidelity in state boundaries, but helps the viewer see the symbols for small states like Rhode Island and Delaware.

Multivariate comparisons of means. Full size (504x433)

It is difficult to compare the means of several groups on many variables. Profile or parallel coordinate plots are often confusing when the curves for different groups cross a great deal. The multivariate star plot shows each of an arbitrary number of variables on radial axes from the origin, here for the means of automobile models, classified by region of manufacture.

In this plot, the variables Price, Gear Ratio and Turning Circle are reflected so larger values represent "better" for all variables; then all variables are first scaled to a 0-1 range. Variables are arranged around the circle by a multivariate effect ordering according to their order on the largest discriminant dimension. The error bars next to each radial axis shows the smallest value of a difference between means required for a (univariate) .05 significant difference.

Enhanced scatterplot matrices. Full size (394x290)

The scatterplot matrix displays the relationships among all pairs of many variables. This example shows the relation among three measures of social competence, but the data in each plot are stratified by the type of setting. To aid perception of how the relations differ across setting, each subplot is enhanced with a data ellipse showing the strength of the relationship. The diagonal panels show the univariate distribution of each variable, again stratified by type of setting. Color is used effectively to keep the settings visually distinct.

Tile Maps for Temporal Patterns Full size (469x602) [18K]. From Mintz, D., Fitz-Simons, T. & Wayland, M. "Tracking Air Quality Trends with SAS/GRAPH", SUGI 22 Proceedings, 807-812.

Description: The tile map is a useful semi-graphical display for data with seasonal variation. One square (tile) is plotted for each day of the year; the color of the tile shows the level of Ozone concentration in Los Angeles for that day, with lighter shades indicating lower concentration and darker shades showing higher concentrations. (Ed. note: this is true of the B/W version in the printed paper, but not true of the color version shown here, which uses 'elevation mapping' of colors to ozone concentration.)

The figure shows the data for the 10 years, 1982 - 1991. Within each year, ozone concentrations are higher in the summer months; Over years, the concentrations in the summer months have decreased.

Graphical Excellence

Turning Tables into Graphs

Complex, high-dimensional data present special challenges to graphical display. Dan Carr describes the construction of this graph in the Statistical Computing and Graphics Newsletter, V6(3) [v63.ps.gz] of the ASA Statistical Computing and Graphics Section . Carr says: A little effort went into splitting the data set into cells, some went into making a function to plot the data in a cell, and a great deal of effort went into attending to details

Steve Majewski's Boxplot. Full size (465 x 300) [6k] (Links to LispStat code and other examples here)

Description: The graph represents elemental concentrations of Calcium derived from least squares fit of filtered X-ray Energy Spectra - measurements taken from Cytoplasm, Mitochondria, and over the whole cell from treated and control samples.

The data points are printed with jittered/randomized X coordinates, to keep them from overprinting and obscuring each other. The boxes are "standard" boxplots (using a modification of the standard XlispStat boxplot function) showing the median value, boxing the inner quartiles, and showing the max and min range.

The white and gray ovals are centered around the weighted means of the data, with the vertical radius being one and two standard deviations, respectively. The thin gray horizontal line is the mean of the combined data.

Trellis plot of Barley data Full size (384 x 987) [15k]

Description: The figure is a Trellis display of data from an agricultural field trial of barley yields at six sites in Minnesota; ten varieties of barley were grown in each of two years. The data were presented by R. A. Fisher in The Design of Experiments and analyzed subsequently by many others.

William Cleveland's display of these data shows an apparent surprise missed by previous investigators, which occurs at the Morris site: For all other sites, 1931 produced a significantly higher overall yield than 1932. The reverse is true at Morris. But most importantly, the amount by which 1932 exceeds 1931 at Morris is similar to the amounts by which 1931 exceeds 1932 at the other sites. More displays, a statistical modeling of the data, and some background checks on the experiment led to the conclusion that the data are in error -- the years for Morris were inadvertently reversed. The background of the data, and analysis with Trellis are described in more detail in these Case Studies and (in PostScript format) in The Visual Design and Control of Trellis Display

The graph uses main effect ordering to arrange the 6 sites and 10 barley varieties from bottom to top according to increasing values of the median yields (collapsed over other factors). This greatly aids perception of trends in the data and makes the Morris data stand out as unusual.

Visual delights

Math Flavored Amusements

A collection of visual delights from University of Pennsylvania, candy for the mathematical eye.

Darts

The Lie Factor

Graphs rely on our understanding that a number is represented visually by the magnitude of some graphical element. Tufte states the principle that: "The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented." He measures the violation of this principle by the Lie Factor, defined as the ratio of the size of an effect shown in the graphic to the size of the effect in the data.

Picture Words
The Lie Factor. Full size (from Tufte, 1983, p.57; gif image by Clay Helberg, Pitfalls of Data Analysis

This graph, from the NY Times, purports to show the mandated fuel economy standards set by the US Department of Transportation. The standard required an increase in mileage from 18 to 27.5, an increase of 53%. The magnitude of increase shown in the graph is 783%, for a whopping lie factor = (783/53) = 14.8!

The Lie Factor. Full size (from Tufte, 1983, p.69)

Another key element in making informative graphs is to avoid confounding design variation with data variation. This means that changes in the scale of the graphic should always correspond to changes in the data being represented. This graph violates that principle by using area to show one-dimensional data, giving a lie factor = 2.8.

Missed Opportunities

The Challenger Disaster Full size (451x228) [3K].

The Space Shuttle Challenger exploded shortly after take-off in January 1986. Subsequent investigation determined that the cause was failure of the O-ring seals used to isolate the fuel supply from burning gases. This figure (scanned badly from Wainer, 1995) shows a graph accompanying the Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986 (vol 1, p. 145) in the aftermath of the disaster.

NASA staff had analysed the data on the relation between ambient temperature and number of O-ring failures (out of 6), but they had excluded observations where no O-rings failed, believing that they were uninformative. Unfortunately, those observations had occurred when the launch temperature was relatively warm (65-80 degF).

The Challenger Disaster Full size (494x424)[4K].

Reanalysis of the O-ring data involved fitting a logistic regression model. This provides a predicted extrapolation (black curve) of the probability of failure to the low (31 degF) temperature at the time of the launch and confidence bands on that extrapolation (red curves). See also Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, 423-426

There's not much data at low temperatures (the confidence band is quite wide), but the predicted probability of failure is uncomfortably high. Would you take a ride on Challenger when the weather is cold?

Context: Compared to What?

Display data in the proper context Full size (412x374) [10K] (from Tufte, 1983, p. 74)

Does stricter enforcement of speed limits lead to a decline in trafic fatalities? You surely can't tell from this graph.

What does it mean to show a decline in traffic fatalities over two years? What was the trend before the change in enforcement? after? Failure to show the relevant context produces a thoroughly misleading display.

... That context tells a different story Full size (412x333) [8K] (from Tufte, 1983, p. 74)

Now we can see that there must be some other important factors other than stricter enforcement.. This information would be completely missed if all you had to look at was the former graph.

The Great Currency Mountain Full size (354 x 217) [4K] (From A.J. Jaffe & H. F. Spirer, Misused Statistics, p. 77)

A Wall Street Journal article displays this graph in an article headlined "Americans hold increasing amounts of cash despite inflation and many other drawbacks". Unfortunately, the data plotted has not been adjusted for inflation.

... Cut Down to Size Full size(354 x 200) [2K]

Redrawn, adjusting for inflation. We now see that there has been a slight rise in the anount of currency in circulation, but hardly anything to get excited about.

What else has not been adjusted for?

Have Something to Say

Art or Artifice? Full size (197x336) [6K].

As a substitute for substance, one can try lots of color, 3D effects, or disguised redundancy. This graph uses all three techniques, to display just five numbers. Note the clever use of mirror-imaging -- the top series is just (100 - the bottom series) and the interesting use curved lines, front and back to avoid the appearance that there's a lot less here than meets the eye.

Tufte (1983, p.118) says, "This may well be the worst graphic ever to find its way into print."