Statistics in Engineering: Building Distributions from a Data Capture (Part 1 of 8)

Tags

Preface
This series is aimed at providing tools for an electrical engineer to gain confidence in the performance and reliability of their design. The focus is on applying statistical analysis to empirical results (i.e. measurements, data sets).

Introduction
This article will establish how to build and describe a Gaussian distribution from a data capture. Most data will be Gaussian (or close). Counting pass/fail in a mask is Binomial, but you can use the same analysis it's just a matter of interpretation. Once you have a distribution you can form an educated opinion on the likelihood your circuit will violate spec in the field.

If you are not familiar with statistics or need a brush up I recommend Schaum's Statistics. It provides a good overview of material and lots of examples without a lot of time spent on proofs.

Concepts
Mean: The mean value (called mu) is simply the average. The sum of all samples divided by the number of samples.

Standard Deviation: The standard deviation (called sigma) is a measure of the dispersion. If the mean sets the center point of a Gaussian distribution, the sigma sets the shape of the curve (tall and peaky to wide and flat).

z-value: The z-value is simply the number of standard deviations a measurement is from the mean. It is (measurement-mean)/sigma.

Skew: A measure of asymmetry of a distribution.

Kurtosis: A measure of the peakedness of a distribution.

Confidence Interval: Estimate the reliability of a calculated value.

Importing Your Data Set
I will use the R software package for statistical analysis. It is cross platform, free and open source. There are several Excel plugins which are good and if you have/can use SAS by all means use it.

The first row of your data set should be the titles for each column. Each column can contain anything but for building a distribution we can assume a single column with a row for each measurement.

NOTE: The following assumes the column name is 'voltage'.

> vmm_voltage<-read.csv(file.choose()) > dim(vmm_voltage) [1] 98973 1 > tail(vmm_voltage) 98970 0.017400 98971 0.017400 98972 0.017400 98973 0.017700 > str(vmm_voltage) 'data.frame': 98973 obs. of 1 variable: $ voltage: num 0.00347 0.0035 0.00374 0.00329 0.0033 0.00295 0.0184 0.0185 0.0183 0.0181 ... > attach(vmm_voltage)

Evaluating The Data
Now we calculate the mean and standard deviation. (This is apparently quite a small voltage.)
> mean(voltage) [1] 0.008929438 > sd(voltage) [1] 0.006329213

Test For Normality
For skew and kurtosis use the skew() and kurtosis() functions. These can give an indication that the data is not normal. However you can use the Shapiro test to evaluate normality. For this test a high p-value indicates normality.
> shapiro.test(voltage) Shapiro-Wilk normality test

data: voltage
W = 0.9938, p-value = 0.5659

Making Statistical Inferences
Now that we have quantified our data in terms of mu and sigma we can now proceed to make statistical inferences based on that data.

To do this R has several functions which are quite handy: pnorm and qnorm. pnorm computes the cumulative probability (from -inf to x) of a number. Or in other words everything to the left of a particular value on a normal distribution. It has an option to produce 1-x or everything to the right of a number.

For example let's calculate the likelihood of getting 0.020V or lower. Then 0.020V or higher. Then the likelihood of getting only the extremes (both tails):
> leftt=pnorm(0.020, mean=0.008929438, sd=0.006329213); leftt [1] 0.959865 > rightt=pnorm(0.020, mean=0.008929438, sd=0.006329213, lower.tail=FALSE); rightt [1] 0.04013502 > 2*rightt [1] 0.08027003

qnorm is cumulative like pnorm and will do one of two things. Given no parameters it will calculate the number of standard deviations (z value) for a given probability. For a handy list of z values see the Wikipedia page on Standard Deviation. z=1.96 (95%) and z=2.58 (99%) are common values. If you plug in the mu and sigma of your data set it will return the value on your distribution for a given percent.

Because this is a two-tailed test and we are excluding the extreme 1% of a normal distribution, this means that half of that 1% is in each tail. You can calculate z by using qnorm like this:
> qnorm(.995) [1] 2.575829 > qnorm(.005) [1] -2.575829 > 0.008929438+qnorm(.995)*0.006329213 [1] 0.02523241 > 0.008929438-qnorm(.995)*0.006329213 [1] -0.007373534

Or alternatively:
> qnorm(.995, mean=0.008929438, sd=0.006329213) [1] 0.02523241 > qnorm(c(.005,.995), mean=0.008929438, sd=0.006329213) [1] -0.007373534 0.025232410

Confidence Intervals
Because we are estimating populations from a sample there will be an error associated with it. It is possible to calculate how much variation for a particular level of confidence. So let's calculate the error in our mean. In this case we will choose a confidence interval of 99% or 0.99.

We calculate our confidence interval:
> 2.58*sd(voltage)/sqrt(98973) [1] 5.190522e-05

This means that I am 99% certain the mean will be within +/- that range.

We can also estimate the confidence interval for sigma, however I could not figure out how to calculate in R. Here is excel code:
=sigma*SQRT((n-1)/CHIINV((.01/2), n-1)) =sigma*SQRT((n-1)/CHIINV(1-(.01/2), n-1))

Plotting The Raw Data
There are a couple ways to plot the raw data:

If you just want to know min and max, use the range function:
> range(voltage)

For box and histogram plots:
> boxplot(voltage) > hist(voltage)

If you have (x,y) data use the plot function. Use help(plot) for its many options:
> plot(xcol,ycol,options)

Next Up
Next article will show how to use linear regression to analyze your data set and predict values. It also introduces hypothesis testing which will be fully explained in later articles.

New content

Recent content

Taxonomy

New content

Recent content

Taxonomy

Search form