History of Scatterplot
According to statistician Edward Tufte, more than 70% of all charts in scientific publications are scatterplots. Although no physical evidence remains, the research of Friendly and Denis indicates that the original scatterplot was made in 1833 by the English scientist John Frederick W. Herschel. For his study on the orbits of double stars, Herschel created a plot of the positional angle of double stars in relation to the year the measurement was taken. This may sound like a line chart, given that one of the variables is time, but this was different because he was using the data to understand a fundamental relationship between two measurements, rather than just tracking a trend.
When to Use a Scatterplot?
1
When you need to discover if your data expresses a trend
Use a scatter plot when you need to view a visual distribution of the data and determine if a trend exists in the same. Depending on how tightly the points cluster together, one can discern a clear trend in the data or with an addition of a regression line which is a statistical tool used to mathematically express a trend in the data. You can also use a scatter plot if an additional(3rd) vaiable needs to be displayed, if the points are coded in color/shape/size).
2
When you need to invest in inferential statistics and predict a future trend
Use scatter plots when you need to validate a hypothesis of what relationship exists between the data points. A regression line(whether linear, cubic, quadratic etc depending upon one’s hypothesis) can express a mathematical relationship between the independent and dependent variable. It can help one to say, to what degree of certainty can we say this line truly describes the trend in the data. If a good fit is established, scatter plots can be used for interpolation- where we find a value inside our set of data points and extrapolation- to find a value outside our set of data points.
3
When you need to determine the degree of correlation which exists in your dataset
When you need an accurate measure of whether your data has a linear relationship a correlation coefficient calculation can be done for scatter plots. When the two sets of data are strongly linked together we say they have a High Correlation. Correlation is Positive when the values increase together, while correlation is Negative when one value decreases as the other increase. This can be imagined as drawing a straight line or curve through the data so that it “fits” as well as possible. The more the points cluster closely around the imaginary line of best fit, the stronger is the relationship that exists between the two variables.
Types of Scatterplots
1. Bubble Chart
A type of chart that displays three dimensions of data, where the area of the bubble can express the the third variable to plot against.
2. Rug Plot
A plot of data for a single quantitative variable, displayed as marks along an axis. Used to visualise the distribution of the data, it is analogous to a histogram with zero-width bins, or a one-dimensional scatter plot.
3. Line Chart
A line chart or a line graph or curve chart is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments.
When Not to Use a Scatterplot?
1
When you do not have paired numerical data but labels
Use a bar graph when your data includes non-numeric (category) data (such as department names to be plotted against revenue) or otherwise a line chart if the data is ordinal(where measures are given to non-numeric concepts, for eg. scores given for bad-1, good-2, very good-3). In such scenarios scatter plots do not serve the purpose of discovering trend, which is best obtained for continous measured quantities i.e interval variables
2
When you need to understand the rate of change between individual data points
Although scatter plots are similar to line graphs in that they start with mapping quantitative data points, the difference is that with a scatter plot, the decision is made that the individual points should not be connected directly together with a line but, instead express a trend. If one needs to view a rate of change(slope) between individual data points, the line graph is visually more coherent to use.