Now let us apply my Rule #1 and visualise the data (see previous Chris Martin post).

Plot A is a histogram in which I have grouped for each of the two sets of data (the partnership scores when Chris was Out and the scores when he was Not out) into bins. Each bin is 5 runs wide except for the first. That is the first bin is from 0 to 2.5 (really to 2), the second from 2.5 to 7.5 etc. What can be seen from this is that there appear to be more very low partnerships when Chris was Out than when the other batsman was Out. However, don’t be fooled by histograms like this. Remember, there were not the same number of innings in which he was out (52) compared to when the other batsman was Out (49). This may distort the graph.

Plot B is better, but harder to read. Each black or red dot is a score. The coloured boxes show the range called the “Interquartile range”. That is, 25% of the scores are below the box, and 25% are above. The line in the middle of the box is the median – that is 50% of score are below and 50% of scores are above. The “Whiskers” (lines above and below the box) show the range of scores.

Plot C is less often used in the medical literature (at least), but is really very useful. It plots cumulatively the percentage of scores below a particular score for each of the two sets of data. For example, we can read off the graph that about 27% of the partnership scores for when Chris Martin was out were zero. If we look a the dashed line at 50% and where it intersects the blue line, then we see that 50% of the scores for when Chris Martin was out were 2 or below. This is a bit more informative than plot B.

What all the plots show is that the distribution of scores in both data sets is highly skewed. That is, there are many more scores at one end of plot A than the other, or the lines in plot C are not straight lines. This is very important because it tells us what tests we can not use and how we should not present data. Quite often when I referee papers, and in papers I read I see the averages (means) presented for data like this. This is wrong. They are presented like:

Chris Out: 8.4±13.9

Chris Not Out: 10.8±11.8

The first number is the mean (ie add all the scores and divide by the number of innings). The second number after the “plus-minus” symbol is called the standard deviation. It is a measure of the spread of the numbers around the mean. In this case the standard deviation is large compared to the mean. Indeed anything more than half the size of the mean is a bit of a give away that the distribution is highly skewed and that presenting the numbers this way is totally meaningless. We should me able to look at the mean and standard deviation and conclude that about 95% of the scores are between two standard deviations below the mean and two above. However two below (8.4 – 2*13.9) is a negative score! Not possible.

What should be presented is the medians with interquartile range (ie the range from where 25% are below and 75% are below).

Chris Out: 2.0 (0-12.8)

Chris Not Out: 8 (1-16.5)

We are now ready to apply a statistical test found in most statistical packages to see if Chris being out or the other batsmen being out was better for the partnership. The test we apply is called the Mann-Whitney U test (or Kruskall-Wallis test if we were comparing 3 or more data sets). Some people say this is comparing the medians – it is not, it is comparing the whole of the two data sets. If you don’t believe me, see http://udel.edu/~mcdonald/statkruskalwallis.html.

So, I apply the test and it gives me the number p=0.12. What does this mean? It means that if Chris Martin were to bat in another 104 innings, and another, and another etc, then 12% of the time we would see the difference (or greater) between the Outs and Not Out partnerships that we do actually see (see significantly p’d for more explanation of p). 12% for a statistician is quite large and so we would suggest that there is no overall difference in partnerships whether Chris Martin was Out or was Not Out. Alas, Chris Martin’s playing days are over and we have the entire “population” of his scores to assess his batting prowess. The kind of statistical test I’ve presented is only really useful when we are looking at a sample from a much greater population. However, in the hope that Chris may make a return to Test cricket one day, then what is presented here should give pause for thought for the next batsman who goes out to bat with him… perhaps there is not a lot to gain by swinging wildly, and thereby increasing their chances of getting out; they are probably not improving the chances of the team.

Tagged: Average, Chris Martin, cricinfo, Cricket, ESPN CricInfo, Mann-Whitney, Mean, Median, Statistics

]]>The best place to begin any quest is with a graph. Here is a graph showing all 104 of Chris’s innings in chronological order. On it is represented the scores when he was Out (red lines) and the scores when he was Not Out (blue lines). Funnily enough he was out and not out exactly 52 times each. We can see immediately that the peak of his batting performance was a score of 12 Not Out which occurred approximately half-way through his career. His best form seems to be innings 30 to 34 where he went undefeated in 5 successive innings scoring 17 runs. On the other hand he had several bad runs where he was Out for zero (red marks below the zero line). One of the interesting things is that his first 4 innings may have given a false impression of his batting prowess. In his first innings he scored 7, well above his eventual average of 2.36. In his 2nd and 4th innings he was 0 Not Out. In between he was 5 Not Out. This coincided with his peak average every, 12 (orange triangles). This allows us to note an important feature of statistics. Let us pretend for a moment the average of 2.36 was “built-in” to Chris Martin from the beginning. This means that it was inevitable that after many innings he would end up with that average. But it is not inevitable that any one innings taken at random is equal to that mean. Importantly, with only a few samples (ie the first few innings) the average at that point can be a long way from the “real” average. This is a phenomenon caused by sampling from a larger population. It is why we have to be very cautious with conclusions drawn from a small sample population. For example, if General Practitioners throughout the country see on average 5 new leukemia cases a year, but we sample only three General Practitioners from Christchurch who saw 8, 9 and 14 then we would be quite wrong to conclude that Christchurch has a higher average leukemia rate than other regions. We need a much larger sample from Christchurch to get a reasonable estimate of Christchurch’s average. There are statistical techniques for deciding what proportion of General Practitioners should be sampled and what the uncertainty is in the average we arrive at. Graphs also help… we can see with Chris that after only 10% of his innings he is within 1 of his average and stays that way throughout the rest of his career (orange triangles).

That’s it for today. More on the legend of Chris Martin in the weeks ahead.

Tagged: batting, Chris Martin, cricinfo, Cricket, ESPN CricInfo, graphs, Statistics

]]>How did I do it?

- I chose Australia winning a one-day international when chasing runs as an outcome (Win or Loss).
- Using data available from Cricinfo I determined which of the following on its own predicts if Australia will win (ie which predicts the outcome better than just flipping a coin): (1) Who won the toss, (2) whether it is a day or night match, (3) whether it is a home or away match, (4) how many runs the opposition scored.
- As it turned out if Australia lost the toss they were more likely to win (!), and, not surprisingly, the fewer runs the opposition scored the more likely they were to win. I then built a mathematical model. All this means is that I came up with an equation where the inputs were the winning or losing of the toss and the number of runs and the output was the probability of winning. This is called a “reference model.”
- I added to this model Ricky Ponting’s last innings score and recalculatd the probability of Australia winning.
- I then could calculate some numbers which told me that by adding Ricky Ponting’s last innings to the model I improved the model’s ability to predict a win and to predict a loss. Below is a graph which I came up with to illustrate this. I call this a Risk Assessment Plot.

So, when the shrimp hit the barbie, the beers are in the esky, and your mate sends down a flipper you can smack him over the fence for you now know that when Ricky Ponting scored well in his last innings, Australia are more likely to win.

- Pickering JW, Endre ZH. New Metrics for Assessing Diagnostic Potential of Candidate Biomarkers. Clin J Am Soc Nephro 2012;7:1355–64.

Tagged: Australia, australian cricket, Biomarkers, cricinfo, Cricket, innings, last match, one-day, Ricky Ponting, Risk Assessment Plot, risk stratification, sports, Statistics

]]>