Tag Archives: statistics

Two for one

I wrote this short annual report in anticipation of being on vacation this week. However, as my editor commented, it is ‘a bit of a non-blog’ and so I have written a second post for today that will be published a few minutes later.

The painting in the thumbnail is by Peter Curran and shows a view of Liverpool’s Anglican Cathedral that is almost the same as the view from the seat at which I usually sit to write this blog. The blog is read world-wide as shown by the distribution of visitors to the blog during 2018 in the temperature map in the graphic below. The weekly readership dropped by 60% at the beginning of April 2018 after I deleted my Facebook page and cut the link between Facebook and this blog (see ‘Some changes to Realize Engineering‘ on March 28th, 2018). However, I am pleased say that the visitor numbers have recovered; and last month’s visitor numbers were only 4% lower than the corresponding month in 2017. So many thanks to those readers that stayed with me, or found the blog again without using Facebook. While, I enjoy writing ‘to make life more fruitful’ to quote Sylvain Tesson (see ‘Thinking more clearly by writing weekly‘), it is also encouraging to know that people are reading the blog.

For those of you that enjoy reading it as much as I enjoy writing, there have been more than 330 posts since the first one in July 2012 – that’s a huge archive for you to browse, if you have nothing else to do. Happy New Year!

Sylvain Tesson, Consolations of the forest: alone in a cabin in the middle Taiga, London: Penguin Books, 2014.

How many repeats do we need?

Leave a reply

This is a question that both my undergraduate students and a group of taught post-graduates have struggled with this month. In thermodynamics, my undergraduate students were estimating absolute zero in degrees Celsius using a simple manometer and a digital thermometer (this is an experiment from my MOOC: Energy – Thermodynamics in Everyday Life). They needed to know how many times to repeat the experiment in order to determine whether their result was significantly different to the theoretical value: -273 degrees Celsius [see my post entitled ‘Arbitrary zero‘ on February 13th, 2013 and ‘Beyond zero‘ the following week]. Meanwhile, the post-graduate students were measuring the strain distribution in a metal plate with a central hole that was loaded in tension. They needed to know how many times to repeat the experiment to obtain meaningful results that would allow a decision to be made about the validity of their computer simulation of the experiment [see my post entitled ‘Getting smarter‘ on June 21st, 2017].

The simple answer is six repeats are needed if you want 98% confidence in the conclusion and you are happy to accept that the margin of error and the standard deviation of your sample are equal. The latter implies that error bars of the mean plus and minus one standard deviation are also 98% confidence limits, which is often convenient. Not surprisingly, only a few undergraduate students figured that out and repeated their experiment six times; and the post-graduates pooled their data to give them a large enough sample size.

The justification for this answer lies in an equation that relates the number in a sample, n to the margin of error, MOE, the standard deviation of the sample, σ, and the shape of the normal distribution described by the z-score or z-statistic, z*: The margin of error, MOE, is the maximum expected difference between the true value of a parameter and the sample estimate of the parameter which is usually the mean of the sample. While the standard deviation, σ, describes the difference between the data values in the sample and the mean value of the sample, μ. If we don’t know one of these quantities then we can simplify the equation by assuming that they are equal; and then n ≥ (z*)².

The z-statistic is the number of standard deviations from the mean that a data value lies, i.e, the distance from the mean in a Normal distribution, as shown in the graphic [for more on the Normal distribution, see my post entitled ‘Uncertainty about Bayesian methods‘ on June 7th, 2017]. We can specify its value so that the interval defined by its positive and negative value contains 98% of the distribution. The values of z for 90%, 95%, 98% and 99% are shown in the table in the graphic with corresponding values of (z*)², which are equivalent to minimum values of the sample size, n (the number of repeats).

Confidence limits are defined as: but when n = z², this simplifies to μ ± σ. So, with a sample size of six (6 = n ≥ z² for 98% confidence) we can state with 98% confidence that there is no significant difference between our mean estimate and the theoretical value of absolute zero when that difference is less than the standard deviation of our six estimates.

BTW – the apparatus for the thermodynamics experiments costs less than £10. The instruction sheet is available here – it is not quite an Everyday Engineering Example but the experiment is designed to be performed in your kitchen rather than a laboratory.

Uncertainty about Bayesian methods

2 Replies

I have written before about why people find thermodynamics so hard [see my post entitled ‘Why is thermodynamics so hard?’ on February 11th, 2015] so I think it is time to mention another subject that causes difficulty: statistics. I am worried that just mentioning the word ‘statistics’ will cause people to stop reading, such is its reputation. Statistics is used to describe phenomena that do not have single values, like the height or weight of my readers. I would expect the weights of my readers to be a normal distribution, that is they form a bell-shaped graph when the number of readers at each value of weight is plotted as a vertical bar from a horizontal axis representing weight. In other words, plotting weight along the x-axis and frequency on the y-axis as in the diagram.

The normal distribution has dominated statistical practice and theory since its equation was first published by De Moivre in 1733. The mean or average value corresponds to the peak in the bell-shaped curve and the standard deviation describes the shape of the bell, basically how fat the bell is. That’s why we learn to calculate the mean and standard deviation in elementary statistics classes, although often no one tells us this or we quickly forget it.

If all of you told me your weight then I could plot the frequency distribution described above. And, if I divided the y-axis, the frequency values, by the total number of readers who sent me weight information then the graph would become a probability density distribution [see my post entitled ‘Wind power‘ on August 7th, 2013]. It would tell me the probability that the reader I met last week had a weight of 70.2kg – the probability would be the height of the bell-shaped curve at 70.2kg. The most likely weight would correspond to the peak value in the curve.

However, I don’t need any of you to send me your weights to be reasonably confident that the weight of the reader I talked to last week was 70.2kg! I cannot be certain about it but the probability is high. The reader was female and lived in the UK and according to the Office of National Statistics (ONS) the average weight of women in the UK is 70.2kg – so it is reasonable to assume that the peak in the bell-shaped curve for my female UK readers will coincide with the national distribution, which makes 70.2kg the most probable weight of the reader I met last week.

However, guessing the weight of a reader becomes more difficult if I don’t know where they live or I can’t access national statistics. The Reverend Thomas Baye (1701-1761) comes to the rescue with the rule named after him. In Bayesian statistics, I can assume that the probability density distribution of readers’ weight is the same as for the UK population and when I receive some information about your weights then I can update this probability distribution to better describe the true distribution. I can update as often as I like and use information about the quality of the new data to control its influence on the updated distribution. If you have got this far then we have both done well; and, I am not going lose you now by expressing Baye’s law in terms of probability, or talking about prior (that’s my initial guess using national statistics) or posterior (that’s the updated one) distributions; because I think the opaque language is one of the reasons that the use of Bayesian statistics has not become widespread.

By the way, I can never be certain about your weight; even if you tell me directly, because I don’t know whether your scales are accurate and whether you are telling the truth! But that’s a whole different issue!

Six NYC subway trains

1 Reply

Distribution of blog visitors in 2014 (from WordPress.com)

It would take six New York City subway trains to hold the number of visitors to this blog last year, according the Annual Report sent to me by WordPress.com. That’s more than double the number of visitors in 2013 which is quite an impressive increase. The visitors came from more 100 countries which makes it a truly global blog, unless I have some globe-trotting readers who visited all of those countries between them during 2014.

The blog is also being published on Tumblr now, which my youngest daughter told me would be a waste of time because users of Tumblr are not interested in the sort of things I write about. However, an original objective of the blog was to increase public understanding of engineering and so this is small step to reach a wider public.

I wrote 54 posts last year so that there are more 120 posts in the archive now of which the five most frequently read are, in descending order:

Closed systems in nature? published on December 21st, 2012

100 Everyday engineering examples published on April 23rd, 2014

Small is beautiful published on October 10th, 2012

Benford’s law published on August 15th, 2014

Zen and entropy published on December 11th, 2013

If you only started reading the blog recently or you are visiting for the first time then you might enjoy some these old favourites.