One of the most basic things you might want to do with a multivariate dataset is check out which variables are correlated with each other. The New York Times featured stats program/language/way of life R has a bunch of tools for such correlation plots (see especially this package) but I make the rest of my graphs with ggplot2 and I since I’m slightly obsessive about maintaining the same formatting for all the plots in a document I decided I should try and recreate a plot like this one from the corrgram package:
This one does a nice job of showing the strength of the correlation (the intensity of each cell in the matrix) and direction of the correlation (in this case blue for positive, red for negative). I also wanted to be able to represent the significance of each correlation and, because I’m a ‘show your data’ kind of a guy, at least have the option to display the value of the correlation coefficient. And there began a journey through some very strange seas of thought. Since I only managed to kludge together my solution by cribbing from people like learnr that have blogged their own approach to putting together a plot I thought I’d share the result here – if you like (or just use) R you might find the whole thing interesting, if you like graphs then you might want to look at the pictures and if you are a sensible person then please, feel free to move along.
Step 1: the data
Let’s use Fisher’s iris dataset (a dataset so famous it has a wikipedia page!) as our example. We can make a correlation matrix for the dataset with ‘pure’ R and use the your favourite function from this thread to make a similar matrix for the p-value of each correlation (or calculate them one correlation at a time if you really want). Finally we’ll convert the p-values to traditional asterisk representations and put everything into a new data frame.
library(ggplot2) data(iris) #make the correlation matrix form the first four columns (5th is species names) (c <- cor(iris[1:4])) # Sepal.Length Sepal.Width Petal.Length Petal.Width #Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 #Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 # ... # ... # same with p-values, then use symnum() to represent the values as asterisks p <- cor.pval(iris[1:4] # Sepal.Length Sepal.Width Petal.Length Petal.Width #Sepal.Length 0.0000000 1.518983e-01 0.000000e+00 0.000000e+00 #Sepal.Width 0.1518983 #0.000000e+00 4.513314e-08 4.073229e-06 # ... # ... stars <- as.character(symnum(p, cutpoints=c(0,0.001,0.01,0.05,1), symbols=c('***', '**', '*', '' ), legend=F)) #now put them alltogether (with melt() to reshape the correlation matrix) molten.iris <- cbind(melt(c), stars) names(molten.iris) <- c("M1", "M2", "corr", "pvalue") molten.iris # M1 M2 corr pvalue #1 Sepal.Length Sepal.Length 1.0000000 *** #2 Sepal.Width Sepal.Length -0.1175698 #3 Petal.Length Sepal.Length 0.8717538 *** # ... # ...
Step2: Get plotting
In ggplot2 plots are built up layer on layer by adding 'geoms' - mappings of your data to a particular shape (bar, point, line...). After a few experiments with trying to use a matrix of points to represent the correlations in the iris dataset I settled on geom_tile. Here's how it looks by default
(p <- ggplot(molten.iris, aes(M1, M2, fill=corr)) = theme_bw() + geom_tile())
Done! Well perhaps not quite yet. Correlation matrices are symmetrical so the bottom right triangle of the plot replicates the top left - it seems a waste to say the same thing twice. Similarly, the diagonal running from bottom left to top right is revealing the unsurprising fact that each measure is perfectly correlated to itself. Add to that the default scale for the fill of each cell doesn't make it obvious that correlation coefficients close to zero are weaker than those near 1 or -1 and we have a little tweaking to do.
Step 3: More data
In ggplot each new layer can have its own data frame, so if we make one with only data from the lower triangle of the original correlation matrix we can plot on those values. (We can use a similar trick to make the diagonal of the plot show each variable's name).
#define each triangle of the plot matric and the diagonal (mi.ids) mi.ids <- subset(molten.iris, M1 == M2) mi.lower <- subset(molten.iris[lower.tri(c),], M1 != M2) mi.upper <- subset(molten.iris[upper.tri(c),], M1 != M2) # now plot just these values, adding labels (geom_text) for the names and th values (p1 <- p + geom_tile(data=mi.lower) + geom_text(data=mi.lower, aes(label=paste(round(corr,3), pvalue))) + geom_text(data=mi.ids, aes(label=M2, colour="grey40")) )
Hmmm, adding the labels to each cell worked fine but cells themselves are all over the place because ggplot reorders the variables before it makes the axes of the plot.
Step 4: Kludge then Polish
Geoms aren't the only parts of a ggplot layer, we can also add scales. By specifying each axis has each measurement in the same order as the correlation matrix our lower-triangle really is plotted in the bottom corner, thankfully the melt function we used to make our original data frame keeps the variables in order - we can use them to set the x and y axes.
. At the same time scale_colour_identity() will make the labels pick up the specified colours and the gradient for the scale_fill is specified.
meas <- as.character(unique(molten.iris$M2)) (p2 <- p1 + scale_colour_identity() + scale_fill_gradientn(colours= c("red", "white", "blue"), limits=c(1,-1)) + scale_x_discrete(limits=meas[length(meas):1]) + #flip the x axis scale_y_discrete(limits=meas) )
Pretty good. A little more polishing to suit my rather minimalist tastes:
none <- theme_blank() p2 + xlab(NULL) + ylab(NULL) + opts(axis.text.x=none) + opts(axis.text.y=none) + opts(axis.ticks=none) + opts(panel.border=none) + opts(legend.position='none') + opts(title="Correlations in the iris dataset")
I'm pretty happy with the results, here is bigger matrix with a few measurements from my snails:
It should be pretty easy to tinker with the graphs above to add more or different information to the plots. Perhaps colored cells in one triangle and the correlation coefficient/p-value above. It doesn't even need to represent a correlation matrix, you could have the genetic distance between samples in a phylogenetic study below and physical distance above.