> restart;

Densities and Distribution Functions

IMPORTANT NOTE! In this notebook we will use a procedure to draw histograms that corresponds more closely to the presentation in the text, as well as some simulated data we will use in our examples. Rather than clutter the worksheet with these, we have placed them in a hideable subsection. In order to activate the procedures and make the data available for your Maple session, first do the following:

>

Click on the + and execute the commands by clicking again on the first command line. It is sufficient to click on ONLY the FIRST line. Then click - to hide.

As usual, you should execute all the commands below as you read through the worksheet and before working on homework problems in your new worksheet.

Describing data with histograms and densities

Example 1. In an effort to understand the weight variation of a certain species of small animal, a biologist has

collected laboratory data on 100 animals raised in a controlled environment. She believes that this is a representative sample of the species. The data is collected in a Maple list, giving weight in kilograms, which we call data1. Here it is:

> data1;

[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]

In order to get a better feeling for the data, we first sort it using Maple's sort command:

> sort(data1);

[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]

By looking at the sorted list, we see that the weights are between 8 and 16 kilograms. To get a visual sense of the data

we can create a histogram of the data set. We select a list of ranges, find the fraction of our 100 values that lie in each range, and then plot a bar graph, where each bar stands over one of our ranges and the area of the bar is the fraction

(no. of data values in the range/ 100). For example, we might start with the ranges 8..10, 10..12, 12..14, 14..16. We will ignore the unlikely possibility of a data value falling exactly at an endpoint of one of the ranges, so we specify the ranges by a single list: [8,10,12,14,16]. We have defined a Maple procedure, histo, which does all the work:

> histo(data1,[8,10,12,14,16]);

[Maple Plot]

>

We can get a more refined picture by making the ranges finer:

> histo(data1,[8,9,10,11,12,13,14,15,16]);

[Maple Plot]

>

If we want to refine the ranges even more, it will be convenient to have Maple construct the range list for us. Also, we will want to combine the histogram plot with other plots later, so we give the histogram plot a name (terminating with a colon : ) and then use the ' display ' command. We will do this for ranges of width 0.5 and then 0.25

> ranges1:= [seq(8+.5*i,i=0..16)];

[Maple Math]

> hist1data1:= histo(data1,ranges1):
display(hist1data1);

[Maple Plot]

> ranges2:= [seq(8+.25*i,i=0..32)];

[Maple Math]
[Maple Math]

> hist2data1:= histo(data1, ranges2): display(hist2data1);

[Maple Plot]

>

Notice that as we refine the ranges the histogram becomes more jagged, even though the total area of the bars is always 1. We would like to find a smooth looking graph which represents these histograms in the sense that areas under the smooth graph should be close to areas under the histogram graphs. For this purpose, mathematicians have invented various families of density functions, which form parametric families. The most common and important of these is called the family of normal densities. There are two parameters, m and s and the formula is given by normalp.

> normalp:=(x,m,s)->1/(sqrt(2*Pi)*s)*exp(-((x-m)/s)^2/2);

[Maple Math]

>

As you've seen in the text, these normal densities have a bell shape. The parameter m gives the location of the center of the bell and the parameter s measures the width of the bell. The total area under the graph is 1, of course, and almost all the area is between m-3s and m+3s.

>

Now we will try to find a normal density which smooths our histograms by plotting one of the histograms together with one of our normal densities. More specifically, we want to find a normal density for which the areas under the density over intervals on the line match the histogram areas. Thus we should pay more attention to the larger bars. We might start with m=11, s=2.

> display([hist1data1,plot(normalp(x,11,2),x=8..16)]);

[Maple Plot]

>

It looks like we might be able to do better. Let's try m = 11.5, s=1 and then m=12, s=1.5.

> display([hist1data1,plot(normalp(x,11.5,1),x=8..16)]);

[Maple Plot]

> display([hist1data1,plot(normalp(x,12,1.5),x=8..16)]);

[Maple Plot]

>

It turns out that the best values are about m=12 and s=1.5. Replace hist1data1 in the command above with hist2data1 and compare.

NOTE: If we had many thousands of measurements instead of just 100, we would find that for any one of the histogram ranges we used above, the normal density with m=12 and s=1.5 would match the histogram very closely, as in the graph below.

[Maple Plot]

The biologist might now use the density normalp(x,12,1.5) to describe the variation of the animal's weight. For example, if she wanted to estimate the probability that a randomly selected animal of this species will weigh between 9 and 13 kilograms, she might calculate the area under the density function normalp(x,12,1.5) between 9 and 13. You might recognize that the function giving this density does not have an elementary antiderivative so we had better resort to numerical integration. (Note: In actual statistical practice we would estimate m and s arithmetically from the data.)

> evalf(Int(normalp(x,12,1.5),x=9..13));

[Maple Math]

Thus the probability is about .725 .

It is often useful to use the cumulative distribution function (c.d.f.) instead of the density function to present information about the variability of our data. We find the probability of a value being between two numbers, a and b, by integrating the density over the interval [a,b]. The cumulative distribution function F(x) is defined as the integral of the density from [Maple Math] to x, so that F(x) gives the probability of value being less than x. For example, the c.d.f. corresponding to the density normalp(x,12,1.5) is defined below, and its graph is plotted. (Remember also that the density function is the derivative of the corresponding c.d.f by the Fundamental Theorem of Calculus.)

> Fnormal:=(x,m,s)->int(normalp(t,m,s),t=-infinity..x);

[Maple Math]

> plot( Fnormal(x,12,1.5), x=7..16);

[Maple Plot]

>

If, for example, we wanted the probability of a randomly chosen animal weighing less than 10 kilograms, we might estimate this probability by

> Fnormal(10,12,1.5);

[Maple Math]

>

Measuring central tendency: Means, Medians and Percentiles

We often try to summarize distributions with a few numbers. Recall that the mean corresponding to a density p(x) is given by [Maple Math] . For a normal density, the parameter m turns out to be the mean as in our example above:

> int(x*normalp(x,12,1.5),x=-infinity..infinity);

[Maple Math]

(You can even do this integration by hand, using a substitution!) The p-th percentile is the number, which we will call xp for which the probability of a randomly chosen value being less than or equal to xp is p. Thus, xp is the solution of the equation F(xp) = p where p is a number between 0 and 1, expressed as a percentage, and F is the c.d.f. The 50-th percentile, x50, is what you already know as the median.

To find the 25-th, 50-th and 75-th percentiles, x25, x50 and x75, for the animal population discussed above, we solve three equations using fsolve. We give fsolve a little help by specifying a range of x values where it should look for a solution.

> x25=fsolve(Fnormal(x,12,1.5) = .25,x=7..16); x50=fsolve(Fnormal(x,12,1.5) = .50,x=7..16), x75=fsolve(Fnormal(x,12,1.5)=.75,x=7..16);

[Maple Math]

[Maple Math]

>

Note that the median and mean coincide for normal densities but not for many other kinds of densities.

Problem 1. The list data2 contains laboratory measurements of the lengths, in centimeters, of 80 earthworms.

a) Look at the data to decide on appropriate histogram and plotting ranges and then plot at least two histograms.

b) Find a normal density which gives good smoothing approximations to your histograms. (Determine suitable values for m and s by trial and error.) Plot your density on the same axes as one of your histograms.

c) Plot the corresponding c.d.f.

d) Assuming that your density and c.d.f. accurately reflect the population of all earthworms, determine the following

i) The probability that a randomly selected earthworm has length between 4 and 5 centimeters.

ii) The probability that a randomly selected earthworm has length more than 4.5 centimeters.

iii) The 60th percentile, x60, of earthworm lengths.

Example 2. Another important family of density functions is the family of exponential densities. These are specified by a single parameter, a, and have value 0 when x is less than 0. For x greater than 0 the exponential density with parameter a is given by

> exponp:=(x,a)-> a*exp(-a*x);

[Maple Math]

It's very important to remember that this holds ONLY WHEN x is POSITIVE. Otherwise the density is 0. Here's a plot, when a=0.5

> plot(exponp(x,0.5),x =0..7);

[Maple Plot]

>

Since the density is 0 from [Maple Math] to 0, the c.d.f. Fexpon(x) can be found by integrating from 0 to x instead of [Maple Math] to x. You should be able to do this by hand, but we'll have Maple do it here, and then we'll plot the c.d.f. when for a=0.5.

> Fexpon:=(x,a)-> int(exponp(t,a), t=0..x);

[Maple Math]

> value(Fexpon(x,a));plot(Fexpon(x,0.5), x=0..7);

[Maple Math]

[Maple Plot]

>

The list data3 , which we display below, consists of 80 measurements of the time elapsed, in minutes, between the arrival of phone calls at a certain 800 phone number. There are theoretical reasons to assume that the distribution of such times is given by an exponential density.

> data3;

[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]
[Maple Math]

>

Problem 2 . a) Plot some histograms, starting with ranges [0,2,4,...,30] and find an exponential density that appears to smooth the histograms. Give your value for the parameter a. (Hint: Start with a=.5. Plot over the interval [0,20].)

b) Assuming that the density you found in a) is reflective of the typical times between phone calls, answer the

following questions.

i) What is the probability that at least 8 minutes elapse between the most recent phone call and the next?

ii) What is the probability that the time between two successive phone calls is between 2.5 and 6.5 minutes?

iii) What is the mean time between calls?

iv) What is the median time between calls?

v) What is the 75th percentile of the elapsed time between calls?

>

Problem 3. Consider the function, DEFINED WHEN x > 0, by

> pnotadensity:=x->x^(5/2)*exp(-x);

[Maple Math]

>

and defined to be 0 when x < 0. a) Determine a value for k so that the function DEFINED WHEN x > 0 by

> p:=x->k*pnotadensity(x);

[Maple Math]

and defined to be 0 when x < 0, IS a density. (Remember that densities must have integral 1.) Put the value you find for k into the definition of p and use it for the remainder of this problem.

b) Plot the density function p(x) and the corresponding c.d.f., for x>0 of course, on the same axes if you can.

c) Use Maple to find the corresponding mean and median.

d) If a random value has this density, find the probability that the value lies between 1.5 and 3.2.

Homework Problems

Your homework for this worksheet consists of Problems 1,2,3 above.

MTH 142 Maple Worksheets written by B. Kaskosz and L. Pakula, Copyright 1999.

Last modified August 1999.

>