Home | Python | IPython |     Share This Page
IPython: Math Processor
A tool that manipulates mathematics as a word processor manipulates words

Copyright © 2014, Paul LutusMessage Page

Normal Distribution | Probability Density Function | Cumulative Distribution Function

(double-click any word to see its definition)

Normal Distribution

An easily understood application for Calculus lies in statistics, in the forms of the normal or Gaussian distribution. The normal distribution or "bell curve" looks like this when plotted in the IPython workbook interface:

The plotted function, $ f(x) = e^{-\frac{x^2}{2}}$, describes the distribution of certain naturally occurring events. This function is the focus of much attention in statistics and the natural sciences because of its ability to predict statistical distributions based on sparse data.

Probability Density Function

A more useful form of the above function, one that accepts arguments for "mean" ($\mu$) and "standard deviation" ($\sigma$), and named the Probability Density Function (PDF), looks like this:

(1) $ \displaystyle pdf(x,\mu,\sigma) = \frac{1}{ \sigma \sqrt{2 \pi}} e^{\left(-\frac{{\left(\mu - x\right)}^{2}}{2 \, \sigma^{2}}\right)} $

Because it accepts argument values common in statistics, the PDF is more suited to statistical work than the normalized form shown earlier, and can be used to characterize naturally occurring distributions like IQ (although this application is understandably a matter of much debate):

Cumulative Distribution Function

For many purposes we need to integrate the Probability Density Function to be able to quantify and characterize natural distributions, but, in spite of its great importance to the field of statistics, there is no analytical integral for the PDF — integration of the PDF is carried out numerically. The much-used Cumulative Distribution Function (CDF) is only ever an approximation based on carefully designed numerical methods:

Pay close attention to the code in the above IPython notebook. Notice that we define the PDF explicitly:
pdf = Lambda((x,mu,sigma),
  (1/(sigma * sqrt(2*pi)) * exp(-(mu-x)**2 / (2*sigma**2)))
)
            

Then, rather than writing out the CDF, we rely on the symbolic Python library to produce an integral for us, based on our definition of the pdf:

cdf = Lambda((a,b,mu,sigma),
  integrate(
    pdf(x,mu,sigma),(x,a,b)
  )
)
            

Notice about this definition of the CDF that it accepts two arguments rather than one, and it accepts arguments for mean and standard deviation. I find this form most useful, but some authors define the CDF in a simpler, less useful way.

The sympy library produces the expected (numerical) definition for the integral of the PDF, given that there is no analytical form. The resulting integral relies on a function named the Error Function (erf), with this definition:

(2) $ \displaystyle erf(x) = \frac{2}{\sqrt{\pi}} \int_0^x e^{-t^2} dt $

I emphasize that, as explained above, practical embodiments of the error function are optimized numerical algorithms, so that a two-argument CDF that relies on it and that has this formal definition:

(3) $ \displaystyle cdf(a,b,\mu,\sigma) = \int_a^b \frac{1}{ \sigma \sqrt{2 \pi}} e^{\left(-\frac{{\left(\mu - x\right)}^{2}}{2 \, \sigma^{2}}\right)} dx $

Actually takes this form in practical embodiments:

(4) $ \displaystyle cdf(a,b,\mu,\sigma) = - \frac{1}{2} \, \text{erf}\left(\frac{\sqrt{2} (a-\mu)}{2 \, \sigma}\right) + \frac{1}{2} \, \text{erf}\left(\frac{\sqrt{2} (b -\mu)}{2 \, \sigma}\right) $

The CDF as shown is configured to accept two boundary arguments and return a definite integral result (an "area under the curve") useful in statistical work. Let's test this assumption by producing a table of classical statistical results (as described in the "68–95–99.7 rule"):

Using these definitions, we can acquire answers to everyday statistical questions that rely on these functions. For example, if we accept the commonly stated premise that the population average IQ is 100 and the standard deviation is 15, what proportion of the population has an IQ at or above 135?

In[1]: N(cdf(135,1e99,100,15))
Out[1]: 0.00981532862864534
            

About 1%. I should add that the sympy library has predefined forms for PDF and CDF, as well as many more types of statistical distributions described here (in particular, see sympy.stats.normal).

To navigate this article set, use the arrows and drop-down lists at the top and bottom of each page.

Home | Python | IPython |     Share This Page