> restart;
Densities and Distribution Functions
IMPORTANT NOTE! In this notebook we will use a procedure to draw histograms that corresponds more closely to the presentation in the text, as well as some simulated data we will use in our examples. Rather than clutter the worksheet with these, we have placed them in a hideable subsection. In order to activate the procedures and make the data available for your Maple session, first do the following:
>
Click on the + and execute the commands by clicking again on the first command line. It is sufficient to click on ONLY the FIRST line. Then click - to hide.
As usual, you should execute all the commands below as you read through the worksheet and before working on homework problems in your new worksheet.
Describing data with histograms and densities
Example 1. In an effort to understand the weight variation of a certain species of small animal, a biologist has
collected laboratory data on 100 animals raised in a controlled environment. She believes that this is a representative sample of the species. The data is collected in a Maple list, giving weight in kilograms, which we call data1. Here it is:
> data1;
In order to get a better feeling for the data, we first sort it using Maple's sort command:
> sort(data1);
By looking at the sorted list, we see that the weights are between 8 and 16 kilograms. To get a visual sense of the data
we can create a histogram of the data set. We select a list of ranges, find the fraction of our 100 values that lie in each range, and then plot a bar graph, where each bar stands over one of our ranges and the area of the bar is the fraction
(no. of data values in the range/ 100). For example, we might start with the ranges 8..10, 10..12, 12..14, 14..16. We will ignore the unlikely possibility of a data value falling exactly at an endpoint of one of the ranges, so we specify the ranges by a single list: [8,10,12,14,16]. We have defined a Maple procedure, histo, which does all the work:
> histo(data1,[8,10,12,14,16]);
>
We can get a more refined picture by making the ranges finer:
> histo(data1,[8,9,10,11,12,13,14,15,16]);
>
If we want to refine the ranges even more, it will be convenient to have Maple construct the range list for us. Also, we will want to combine the histogram plot with other plots later, so we give the histogram plot a name (terminating with a colon : ) and then use the ' display ' command. We will do this for ranges of width 0.5 and then 0.25
> ranges1:= [seq(8+.5*i,i=0..16)];
>
hist1data1:= histo(data1,ranges1):
display(hist1data1);
> ranges2:= [seq(8+.25*i,i=0..32)];
> hist2data1:= histo(data1, ranges2): display(hist2data1);
>
Notice that as we refine the ranges the histogram becomes more jagged, even though the total area of the bars is always 1. We would like to find a smooth looking graph which represents these histograms in the sense that areas under the smooth graph should be close to areas under the histogram graphs. For this purpose, mathematicians have invented various families of density functions, which form parametric families. The most common and important of these is called the family of normal densities. There are two parameters, m and s and the formula is given by normalp.
> normalp:=(x,m,s)->1/(sqrt(2*Pi)*s)*exp(-((x-m)/s)^2/2);
>
As you've seen in the text, these normal densities have a bell shape. The parameter m gives the location of the center of the bell and the parameter s measures the width of the bell. The total area under the graph is 1, of course, and almost all the area is between m-3s and m+3s.
>
Now we will try to find a normal density which smooths our histograms by plotting one of the histograms together with one of our normal densities. More specifically, we want to find a normal density for which the areas under the density over intervals on the line match the histogram areas. Thus we should pay more attention to the larger bars. We might start with m=11, s=2.
>
display([hist1data1,plot(normalp(x,11,2),x=8..16)]);
>
It looks like we might be able to do better. Let's try m = 11.5, s=1 and then m=12, s=1.5.
> display([hist1data1,plot(normalp(x,11.5,1),x=8..16)]);
> display([hist1data1,plot(normalp(x,12,1.5),x=8..16)]);
>
It turns out that the best values are about m=12 and s=1.5. Replace hist1data1 in the command above with hist2data1 and compare.
NOTE: If we had many thousands of measurements instead of just 100, we would find that for any one of the histogram ranges we used above, the normal density with m=12 and s=1.5 would match the histogram very closely, as in the graph below.
The biologist might now use the density normalp(x,12,1.5) to describe the variation of the animal's weight. For example, if she wanted to estimate the probability that a randomly selected animal of this species will weigh between 9 and 13 kilograms, she might calculate the area under the density function normalp(x,12,1.5) between 9 and 13. You might recognize that the function giving this density does not have an elementary antiderivative so we had better resort to numerical integration. (Note: In actual statistical practice we would estimate m and s arithmetically from the data.)
> evalf(Int(normalp(x,12,1.5),x=9..13));
Thus the probability is about .725 .
It is often useful to use the
cumulative distribution function (c.d.f.)
instead of the density function to present information about the variability of our data. We find the probability of a value being between two numbers, a and b, by integrating the density over the interval [a,b]. The cumulative distribution function F(x) is defined as the integral of the density from
to x, so that F(x) gives the probability of value being
less than x.
For example, the c.d.f. corresponding to the density normalp(x,12,1.5) is defined below, and its graph is plotted. (Remember also that the density function is the derivative of the corresponding c.d.f by the Fundamental Theorem of Calculus.)
> Fnormal:=(x,m,s)->int(normalp(t,m,s),t=-infinity..x);
> plot( Fnormal(x,12,1.5), x=7..16);
>
If, for example, we wanted the probability of a randomly chosen animal weighing less than 10 kilograms, we might estimate this probability by
> Fnormal(10,12,1.5);
>
Measuring central tendency: Means, Medians and Percentiles
We often try to summarize distributions with a few numbers. Recall that the mean corresponding to a density p(x) is given by
. For a normal density, the parameter m turns out to be the mean as in our example above:
> int(x*normalp(x,12,1.5),x=-infinity..infinity);
(You can even do this integration by hand, using a substitution!) The p-th percentile is the number, which we will call xp for which the probability of a randomly chosen value being less than or equal to xp is p. Thus, xp is the solution of the equation F(xp) = p where p is a number between 0 and 1, expressed as a percentage, and F is the c.d.f. The 50-th percentile, x50, is what you already know as the median.
To find the 25-th, 50-th and 75-th percentiles, x25, x50 and x75, for the animal population discussed above, we solve three equations using fsolve. We give fsolve a little help by specifying a range of x values where it should look for a solution.
> x25=fsolve(Fnormal(x,12,1.5) = .25,x=7..16); x50=fsolve(Fnormal(x,12,1.5) = .50,x=7..16), x75=fsolve(Fnormal(x,12,1.5)=.75,x=7..16);
>
Note that the median and mean coincide for normal densities but not for many other kinds of densities.
Problem 1. The list data2 contains laboratory measurements of the lengths, in centimeters, of 80 earthworms.
a) Look at the data to decide on appropriate histogram and plotting ranges and then plot at least two histograms.
b) Find a normal density which gives good smoothing approximations to your histograms. (Determine suitable values for m and s by trial and error.) Plot your density on the same axes as one of your histograms.
c) Plot the corresponding c.d.f.
d) Assuming that your density and c.d.f. accurately reflect the population of all earthworms, determine the following
i) The probability that a randomly selected earthworm has length between 4 and 5 centimeters.
ii) The probability that a randomly selected earthworm has length more than 4.5 centimeters.
iii) The 60th percentile, x60, of earthworm lengths.
Example 2. Another important family of density functions is the family of exponential densities. These are specified by a single parameter, a, and have value 0 when x is less than 0. For x greater than 0 the exponential density with parameter a is given by
> exponp:=(x,a)-> a*exp(-a*x);
It's very important to remember that this holds ONLY WHEN x is POSITIVE. Otherwise the density is 0. Here's a plot, when a=0.5
> plot(exponp(x,0.5),x =0..7);
>
Since the density is 0 from
to 0, the c.d.f. Fexpon(x) can be found by integrating from 0 to x instead of
to x. You should be able to do this by hand, but we'll have Maple do it here, and then we'll plot the c.d.f. when for a=0.5.
> Fexpon:=(x,a)-> int(exponp(t,a), t=0..x);
> value(Fexpon(x,a));plot(Fexpon(x,0.5), x=0..7);
>
The list data3 , which we display below, consists of 80 measurements of the time elapsed, in minutes, between the arrival of phone calls at a certain 800 phone number. There are theoretical reasons to assume that the distribution of such times is given by an exponential density.
> data3;
>
Problem 2 . a) Plot some histograms, starting with ranges [0,2,4,...,30] and find an exponential density that appears to smooth the histograms. Give your value for the parameter a. (Hint: Start with a=.5. Plot over the interval [0,20].)
b) Assuming that the density you found in a) is reflective of the typical times between phone calls, answer the
following questions.
i) What is the probability that at least 8 minutes elapse between the most recent phone call and the next?
ii) What is the probability that the time between two successive phone calls is between 2.5 and 6.5 minutes?
iii) What is the mean time between calls?
iv) What is the median time between calls?
v) What is the 75th percentile of the elapsed time between calls?
>
Problem 3. Consider the function, DEFINED WHEN x > 0, by
> pnotadensity:=x->x^(5/2)*exp(-x);
>
and defined to be 0 when x < 0. a) Determine a value for k so that the function DEFINED WHEN x > 0 by
> p:=x->k*pnotadensity(x);
and defined to be 0 when x < 0, IS a density. (Remember that densities must have integral 1.) Put the value you find for k into the definition of p and use it for the remainder of this problem.
b) Plot the density function p(x) and the corresponding c.d.f., for x>0 of course, on the same axes if you can.
c) Use Maple to find the corresponding mean and median.
d) If a random value has this density, find the probability that the value lies between 1.5 and 3.2.
Homework Problems
Your homework for this worksheet consists of Problems 1,2,3 above.
MTH 142 Maple Worksheets written by B. Kaskosz and L. Pakula, Copyright 1999.
Last modified August 1999.
>