Scatterplot

A scatter plot, also known as a scatter chart or a scatter graph, is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables – one plotted along the x-axis and the other plotted along the y-axis. Scatter plots are primarily used to indicate the relationship between two variables (linear, parabolic, hyperbolic, etc.).

Quick details

What: Discover Change, Distribution

Why: Determine if your cause and effect are interrelated and predict the future

History of Scatterplot

According to statistician Edward Tufte, more than 70% of all charts in scientific publications are scatterplots. Although no physical evidence remains, the research of Friendly and Denis indicates that the original scatterplot was made in 1833 by the English scientist John Frederick W. Herschel. For his study on the orbits of double stars, Herschel created a plot of the positional angle of double stars in relation to the year the measurement was taken. This may sound like a line chart, given that one of the variables is time, but this was different because he was using the data to understand a fundamental relationship between two measurements, rather than just tracking a trend.

Herschel’s data on the orbits of Virginis, together with his eye-smoothed, interpolated curve (solid line, hollow circles) and a less-smoothed curve (gray, dashed). Circles around each data point are of size proportional to the weight for each observation.

Source

When to Use a Scatterplot?

1When you need to discover if your data expresses a trend

Use a scatter plot when you need to view a visual distribution of the data and determine if a trend exists in the same.  Depending on how tightly the points cluster together, one can discern a clear trend in the data or with an addition of a regression line which is a statistical tool used to mathematically express a trend in the data.  You can also use a scatter plot if an additional(3rd) vaiable needs to be displayed, if the points are coded in color/shape/size).

The Atlantic Cities (2012) plots a city’s “Metro Health Index” (a factor measuring the share of people who smoke or are obese) as it correlates to the city’s median income.

Source

2When you need to invest in inferential statistics and predict a future trend

Use scatter plots when you need to validate a hypothesis of what relationship exists between the data points. A regression line(whether linear, cubic, quadratic etc depending upon one’s hypothesis) can express a mathematical relationship between the independent and dependent variable. It can help one to say, to what degree of certainty can we say this line truly describes the trend in the data. If a good fit is established, scatter plots can be used for interpolation- where we find a value inside our set of data points and extrapolation- to find a value outside our set of data points.

Scatterplots of reported versus extrapolated annual number of drinks from (a), 1-month (b), 3-month

Source

3When you need to determine the degree of correlation which exists in your dataset

When you need an accurate measure of whether your data has a linear relationship a correlation coefficient calculation can be done for scatter plots. When the two sets of data are strongly linked together we say they have a High Correlation. Correlation is Positive when the values increase together, while correlation is Negative when one value decreases as the other increase. This can be imagined as drawing a straight line or curve through the data so that it “fits” as well as possible. The more the points cluster closely around the imaginary line of best fit, the stronger is the relationship that exists between the two variables.

Regresion coefficient expressing different types of correlation existing in a scatterplot

Source

Types of Scatterplots

1. Bubble Chart

A type of chart that displays three dimensions of data, where the area of the bubble can express the the third variable to plot against.

2. Rug Plot

A plot of data for a single quantitative variable, displayed as marks along an axis. Used to visualise the distribution of the data,  it is analogous to a histogram with zero-width bins, or a one-dimensional scatter plot.

3. Line Chart

A line chart or a line graph or curve chart is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments.

When Not to Use a Scatterplot?

1When you do not have paired numerical data but labels

Use a bar graph when your data includes non-numeric (category) data (such as department names to be plotted against revenue) or otherwise a line chart if the data is ordinal(where measures are given to non-numeric concepts, for eg. scores given for bad-1, good-2, very good-3). In such scenarios scatter plots do not serve the purpose of discovering trend, which is best obtained for continous measured quantities i.e interval variables

2When you need to understand the rate of change between individual data points

Although scatter plots are similar to line graphs in that they start with mapping quantitative data points, the difference is that with a scatter plot, the decision is made that the individual points should not be connected directly together with a line but, instead express a trend. If one needs to view a rate of change(slope) between individual data points, the line graph is visually more coherent to use.