Introduction to the practice of statistics solutions manual pdf

Retaining the same accessible format as the popular first edition, SAS and R: Data Management, Statistical Analysis, and Graphics, Second Edition explains how to easily perform an analytical task in both SAS and R, without having to navigate through the extensive, idiosyncratic, and sometimes unwieldy software documentation. The book covers many common tasks, such as data management, descriptive summaries, inferential procedures, regression analysis, and graphics, along with more complex applications. This edition now covers RStudio, a powerful and easy-to-use interface for R. It incorporates a number of additional topics, including application program interfaces (APIs), database management systems, reproducible analysis tools, Markov chain Monte Carlo (MCMC) methods, and finite mixture models. It also includes extended examples of simulations and many new examples. Through the extensive indexing and cross-referencing, users can directly find and implement the material they need. SAS users can look up tasks in the SAS index and then find the associated R code while R users can benefit from the R index in a similar manner. Numerous example analyses demonstrate the code in action and facilitate further exploration. Features • Presents parallel examples in SAS and R to demonstrate how to use the software and derive identical answers regardless of software choice • Takes users through the process of statistical coding from beginning to end • Contains worked examples of basic and complex tasks, offering solutions to stumbling blocks often encountered by new users • Includes an index for each software, allowing users to easily locate procedures • Shows how RStudio can be used as a powerful, straightforward interface for R • Covers APIs, reproducible analysis, database management systems, MCMC methods, and finite mixture models • Incorporates extensive examples of simulations • Provides the SAS and R example code, datasets, and more online

Enter the characters you see below

Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

Type the characters you see in this image:

Introduction to the practice of statistics solutions manual pdf

Try different image

Conditions of Use Privacy Policy

© 1996-2014, Amazon.com, Inc. or its affiliates

You're Reading a Free Preview
Pages 7 to 10 are not shown in this preview.

You're Reading a Free Preview
Pages 14 to 19 are not shown in this preview.

You're Reading a Free Preview
Page 26 is not shown in this preview.

You're Reading a Free Preview
Pages 30 to 42 are not shown in this preview.

You're Reading a Free Preview
Pages 46 to 49 are not shown in this preview.

You're Reading a Free Preview
Pages 57 to 61 are not shown in this preview.

You're Reading a Free Preview
Pages 71 to 130 are not shown in this preview.

You're Reading a Free Preview
Pages 137 to 146 are not shown in this preview.

You're Reading a Free Preview
Pages 160 to 161 are not shown in this preview.

You're Reading a Free Preview
Page 167 is not shown in this preview.

Chapter 1 Solutions 1.1. Most students will prefer to work in seconds, to avoid having to work with decimals or fractions. 1.2. Who? The individuals in the data set are students in a statistics class. What? There are eight variables: ID (a label, with no units); Exam1, Exam2, Homework, Final, and Project (in units in “points,” scaled from 0 to 100); TotalPoints (in points, computed from the other scores, on a scale of 0 to 900); and Grade (A, B, C, D, and E). Why? The primary purpose of the data is to assign grades to the students in this class, and (presumably) the variables are appropriate for this purpose. (The data might also be useful for other purposes.) 1.3. Exam1 = 79, Exam2 = 88, Final = 88. 1.4. For this student, TotalPoints = 2 · 86 + 2 · 82 + 3 · 77 + 2 · 90 + 80 = 827, so the grade is B. 1.5. The cases are apartments. There are five variables: rent (quantitative), cable (categorical), pets (categorical), bedrooms (quantitative), distance to campus (quantitative). 1.6. (a) To find injuries per worker, divide the rates in Example 1.6 by 100,000 (or, redo the computations without multiplying by 100,000). For wage and salary workers, there are 0.000034 fatal injuries per worker. For self-employed workers, there are 0.000099 fatal injuries per worker. (b) These rates are 1/10 the size of those in Example 1.6, or 10,000 times larger than those in part (a): 0.34 fatal injuries per 10,000 wage/salary workers, and 0.99 fatal injuries per 10,000 self-employed workers. (c) The rates in Example 1.6 would probably be more easily understood by most people, because numbers like 3.4 and 9.9 feel more “familiar.” (It might be even better to give rates per million worker: 34 and 99.) 1.7. Shown are two possible stemplots; the first uses split stems (described on page 11 of the text). The scores are slightly left-skewed; most range from 70 to the low 90s.

5 6 6 7 7 8 8 9 9

58 0 58 0023 5558 00003 5557 0002233 8

5 6 7 8 9

58 058 00235558 000035557 00022338

1.8. Preferences will vary. However, the stemplot in Figure 1.8 shows a bit more detail, which is useful for comparing the two distributions. 1.9. (a) The stemplot of the altered data is shown on the right. (b) Blank stems should always be retained (except at the beginning or end of the stemplot), because the gap in the distribution is an important piece of information about the data.

53

1 2 2 3 3 4 4 5

6 5568 34 55678 012233 8 1

1.10. Student preferences will vary. The stemplot has the advantage of showing each individual score. Note that this histogram has the same shape as the second histogram in Exercise 1.7.

Chapter 1

Frequency

54

9 8 7 6 5 4 3 2 1 0 50

Frequency

1.11. Student preferences may vary, but the larger classes in this histogram hide a lot of detail.

Looking at Data—Distributions

60

90

100

18 16 14 12 10 8 6 4 2 0 40

60 80 First exam scores

100

7 6 Frequency

1.12. This histogram shows more details about the distribution (perhaps more detail than is useful). Note that this histogram has the same shape as the first histogram in the solution to Exercise 1.7.

70 80 First exam scores

5 4 3 2 1 0 55

60

65

70 75 80 85 90 First exam scores

95 100

1.13. Using either a stemplot or histogram, we see that the distribution is left-skewed, centered near 80, and spread from 55 to 98. (Of course, a histogram would not show the exact values of the maximum and minimum.) 1.14. (a) The cases are the individual employees. (b) The first four (employee identification number, last name, first name, and middle initial) are labels. Department and education level are categorical variables; number of years with the company, salary, and age are quantitative variables. (c) Column headings in student spreadsheets will vary, as will sample cases. 1.15. A Web search for “city rankings” or “best cities” will yield lots of ideas, such as crime rates, income, cost of living, entertainment and cultural activities, taxes, climate, and school system quality. (Students should be encouraged to think carefully about how some of these might be quantitatively measured.)

Solutions

55

1.16. Recall that categorical variables place individuals into groups or categories, while quantitative variables “take numerical values for which arithmetic operations. . . make sense.” Variables (a), (d), and (e)—age, amount spent on food, and height—are quantitative. The answers to the other three questions—about dancing, musical instruments, and broccoli—are categorical variables. 1.18. Student answers will vary. A Web search for “college ranking methodology” gives some ideas; in recent year, U.S. News and World Report used “16 measures of academic excellence,” including academic reputation (measured by surveying college and university administrators), retention rate, graduation rate, class sizes, faculty salaries, student-faculty ratio, percentage of faculty with highest degree in their fields, quality of entering students (ACT/SAT scores, high school class rank, enrollment-to-admission ratio), financial resources, and the percentage of alumni who give to the school.

brown

gray

white

red

black

blue

yellow

orange

black

red

purple

green

40 35 30 25 20 15 10 5 0 blue

Percent

1.19. For example, blue is by far the most popular choice; 70% of respondents chose 3 of the 10 options (blue, green, and purple).

Favorite color 30 25 Percent

1.20. For example, opinions about least-favorite color are somewhat more varied than favorite colors. Interestingly, purple is liked and disliked by about the same fractions of people.

20 15 10 5 white

green

gray

yellow

purple

brown

orange

0

Least favorite color

1.21. (a) There were 232 total respondents. The table that follows gives the percents; for 10 . = 4.31%. (b) The bar graph is on the following page. (c) For example, 87.5% example, 232 of the group were between 19 and 50. (d) The age-group classes do not have equal width: The first is 18 years wide, the second is 6 years wide, the third is 11 years wide, etc. Note: In order to produce a histogram from the given data, the bar for the first age group would have to be three times as wide as the second bar, the third bar would have to be wider than the second bar by a factor of 11/6, etc. Additionally, if we change a bar’s

56

Chapter 1

Looking at Data—Distributions

width by a factor of x, we would need to change that bar’s height by a factor of 1/x.

70 and over

51 to 69

36 to 50

25 to 35

1 to 18

19 to 24

Percent 4.31% 41.81% 30.17% 15.52% 6.03% 2.16%

Percent

Age group (years) 1 to 18 19 to 24 25 to 35 36 to 50 51 to 69 70 and over

40 35 30 25 20 15 10 5 0

Age group (years)

1.22. (a) & (b) The bar graph and pie charts are shown below. (c) A clear majority (76%) agree or strongly agree that they browse more with the iPhone than with their previous phone. (d) Student preferences will vary. Some might prefer the pie chart because it is more familiar. Strongly disagree

Response percent

50 40 30

Mildly disagree

20

Strongly agree

Mildly agree

10 0 Strongly disagree 25 Replacement percent

20 15 10 5

W

Previous phone model

g thi n

r

he

Ot

No

ian

k

mb

kic

Sy

de Si

ry er kB

lm

Pa

Bl

ow

ind

e

bil

o sM

ac

zr

0 Ra

1.23. Ordering bars by decreasing height shows the models most affected by iPhone sales. However, because “other phone” and ”replaced nothing” are different than the other categories, it makes sense to place those two bars last (in any order).

ola

Mildly disagree

tor

Mildly agree

Mo

Strongly agree

Solutions

57

10

Paper

Metals

5

Other

Metals

15

Glass

Food scraps

20

Wood

25

Glass Other Wood Rubber, leather, textile

Rubber, leather, textiles

Paper, paperboard

Plastics

30

Yard trimmings

Percent of total waste

1.24. (a) The weights add to 254.2 million tons, and the percents add to 99.9. (b) & (c) The bar graph and pie chart are shown below.

Plastics

Yard trimmings Food scraps

0 Source

60

60

50

50

Percent recycled

40 30 20 10 0

30 20 10 0 r

pe

s ng

s

im

mi

O

Pa

s

tal

Me

Tr

mi

im

W

r the

Tr

Ru

d oo

ng

r

e bb

s tic

as

Material

1.26. (a) The bar graph is shown on the right. (b) The graph clearly illustrates the dominance of Google; its bar dwarfs those of the other search engines.

s

as

Gl

r

be

b Ru

Material

Market share (%)

G

r

s tal ape P Me

Pl

s las

ps ra sc od Fo

40

d

Pl Fo asti od cs sc ra ps

Percent recycled

1.25. (a) & (b) Both bar graphs are shown below. (c) The ordered bars in the graph from (b) make it easier to identify those materials that are frequently recycled and those that are not. (d) Each percent represents part of a different whole. (For example, 2.6% of food scraps are recycled; 23.7% of glass is recycled, etc.)

oo

W

r

he

Ot

80 70 60 50 40 30 20 10 0 Google Yahoo

MSN

AOL Microsoft Ask Live Search engine

Other

58

Chapter 1

Looking at Data—Distributions

Percent of all spam

1.27. The two bar graphs are shown below. 20

20

15

15

10

10

5

5

0

0 Adult

Financial Health

Leisure Products Scams

Products Financial

Adult

Scams

Leisure

Health

Type of spam

Type of spam

10 8 6 4 2 rk Au ey str a Co lia lom bia Ch ile Fra nc No e rw a Sw y ed en Me Ve xico ne So zue uth la A Ho frica ng Ko ng Eg De ypt nm ark Sp ain Ind Ge ia rm an y Isr ae l Ita ly

Tu

do

na Ca

ing dK Un

ite

da

0 m

Facebook users (millions)

1.28. (a) The bar graph is below. (b) The number of Facebook users trails off rapidly after the top seven or so. (Of course, this is due in part to the variation in the populations of these countries. For example, that Norway has nearly half as many Facebook users as France is remarkable, because the 2008 populations of France and Norway were about 62.3 million and 4.8 million, respectively.)

Country

1.29. (a) Most countries had moderate (single- or double-digit) increases in Facebook usages. Chile (2197%) is an extreme outlier, as are (maybe) Venezuela (683%) and Colombia (246%). (b) In the stemplot on the right, Chile and Venezuela have been omitted, and stems are split five ways. (c) One observation is that, even without the outliers, the distribution is right-skewed. (d) The stemplot can show some of the detail of the low part of the distribution, if the outliers are omitted.

0 0 0 0 0 1 1 1 1 1 2 2 2

000 2333 4444 6 99 33

4

59

70 60 50 40 30 20 10 Theology

M.B.A.

M.D.

Law

Other M.S.

Other Ph.D.

Ed.D.

Other M.A.

0 M.Ed.

1.30. (a) The given percentages refer to nine distinct groups (all M.B.A. degrees, all M.Ed. degrees, and so on) rather than one single group. (b) Bar graph shown on the right. Bars are ordered by height, as suggested by the text; students may forget to do this or might arrange in the opposite order (smallest to largest).

Degrees earned by women (%)

Solutions

Yel

low

Oth

er

ld

d

/go

Re

e Blu

y

ite Wh

rl

Gra

er

pea

ite

Wh

Silv

Bla

ck

0

Color

25 20 15 10 5

d /go l

rl

low

ite

d

ite Wh

pea

Yel

Color

Re

e Gra y Bla ck

0 Blu

ld

er Oth

d

/go

Re

low

10

Wh

Color

Yel

Blu e

y

ite Wh

rl

Gra

er

pea

ite

Wh

Silv

Bla

ck

0

15

er

5

Intermediate cars

Oth

10

20

er

15

Luxury cars

Silv

20

25

5

Percent of intermediate cars

Percent of luxury cars

1.31. (a) The luxury car bar graph is below on the left; bars are in decreasing order of size (the order given in the table). (b) The intermediate car bar graph is below on the right. For this stand-alone graph, it seemed appropriate to re-order the bars by decreasing size. Students may leave the bars in the order given in the table; this (admittedly) might make comparison of the two graphs simpler. (c) The graph on the right is one possible choice for comparing the two types of cars: for each color, we have one bar for each car type.

Percent

Graduate degree

1.32. This distribution is skewed to the right, meaning that Shakespeare’s plays contain many short words (up to six letters) and fewer very long words. We would probably expect most authors to have skewed distributions, although the exact shape and spread will vary.

60

Chapter 1

Looking at Data—Distributions

1.33. Shown is the stemplot; as the text suggests, we have trimmed numbers (dropped the last digit) and split stems. 359 mg/dl appears to be an outlier. Overall, glucose levels are not under control: Only 4 of the 18 had levels in the desired range.

1.34. The back-to-back stemplot on the right suggests that the individual-instruction group was more consistent (their numbers have less spread) but not more successful (only two had numbers in the desired range).

0 1 1 2 2 3 3

Individual 22 99866655 22222 8

0 1 1 2 2 3 3

799 0134444 5577 0 57 5 Class 799 0134444 5577 0 57 5

1.35. The distribution is roughly symmetric, centered near 7 (or “between 6 and 7”), and spread from 2 to 13. 1.36. (a) Totals emissions would almost certainly be higher for 0 00000000000000011111 0 222233333 very large countries; for example, we would expect that even 0 445 with great attempts to control emissions, China (with over 0 6677 1 billion people) would have higher total emissions than the 0 888999 1 001 smallest countries in the data set. (b) A stemplot is shown; a 1 histogram would also be appropriate. We see a strong right 1 skew with a peak from 0 to 0.2 metric tons per person and a 1 67 smaller peak from 0.8 to 1. The three highest countries (the 1 9 United States, Canada, and Australia) appear to be outliers; apart from those countries, the distribution is spread from 0 to 11 metric tons per person. 1.37. To display the 0 000000000000000000000000000000000000011111111111111111111 0 2222222222222222233333333333333333333333 distribution, use 0 444444444444444444445555555555555555555 either a stemplot 0 666666666666666666667777777777777 or a histogram. DT 0 888888888888888999999999999999999 1 000000000000111111111 scores are skewed to 1 22222222222233333333333 the right, centered 1 444444455 near 5 or 6, spread 1 66666777 from 0 to 18. There 1 8 are no outliers. We might also note that only 11 of these 264 women (about 4%) scored 15 or higher.

Solutions

61

Frequency

1.38. (a) The first histogram shows two modes: 5–5.2 and 5.6–5.8. (b) The second histogram has peaks in locations close to those of the first, but these peaks are much less pronounced, so they would usually be viewed as distinct modes. (c) The results will vary with the software used. 18 16 14 12 10 8 6 4 2 0 4.2

4.6

5

5.4 5.8 6.2 Rainwater pH

6.6

7

18 16 14 12 10 8 6 4 2 0 4.14

4.54

4.94

5.34 5.74 6.14 Rainwater pH

6.54

6.94

1.39. Graph (a) is studying time (Question 4); it is reasonable to expect this to be right-skewed (many students study little or not at all; a few study longer). Graph (d) is the histogram of student heights (Question 3): One would expect a fair amount of variation but no particular skewness to such a distribution. The other two graphs are (b) handedness and (c) gender—unless this was a particularly unusual class! We would expect that right-handed students should outnumber lefties substantially. (Roughly 10 to 15% of the population as a whole is left-handed.) 1.40. Sketches will vary. The distribution of coin years would be left-skewed because newer coins are more common than older coins. Women Men 1.41. (a) Not only are most responses multiples of 10; 0 033334 many are multiples of 30 and 60. Most people will 96 0 66679999 “round” their answers when asked to give an estimate 22222221 1 2222222 888888888875555 1 558 like this; in fact, the most striking answers are ones 4440 2 00344 such as 115, 170, or 230. The students who claimed 360 2 3 0 minutes (6 hours) and 300 minutes (5 hours) may have 6 3 been exaggerating. (Some students might also “consider suspicious” the student who claimed to study 0 minutes per night. As a teacher, I can easily believe that such students exist, and I suspect that some of your students might easily accept that claim as well.) (b) The stemplots suggest that women (claim to) study more than men. The approximate centers are 175 minutes for women and 120 minutes for men.

62

Chapter 1

Looking at Data—Distributions

1.42. The stemplot gives more information than a histogram (since all the original numbers can be read off the stemplot), but both give the same impression. The distribution is roughly symmetric with one value (4.88) that is somewhat low. The center of the distribution is between 5.4 and 5.5 (the median is 5.46, the mean is 5.448); if asked to give a single estimate for the “true” density of the earth, something in that range would be the best answer.

48 49 50 51 52 53 54 55 56 57 58

8 7 0 6799 04469 2467 03578 12358 59 5

1.43. (a) There are four variables: GPA, IQ, and self-concept are quantitative, while gender is categorical. (OBS is not a variable, since it is not really a “characteristic” of a student.) (b) Below. (c) The distribution is skewed to the left, with center (median) around 7.8. GPAs are spread from 0.5 to 10.8, with only 15 below 6. (d) There is more variability among the boys; in fact, there seems to be a subset of boys with GPAs from 0.5 to 4.9. Ignoring that group, the two distributions have similar shapes. 0 1 2 3 4 5 6 7 8 9 10

5 8 4 4689 0679 1259 0112249 22333556666666788899 0000222223347899 002223344556668 01678

Female

4 7 952 4210 98866533 997320 65300 710

1.44. Stemplot at right, with split stems. The distribution is fairly symmetric—perhaps slightly left-skewed—with center around 110 (clearly above 100). IQs range from the low 70s to the high 130s, with a “gap” in the low 80s.

0 1 2 3 4 5 6 7 8 9 10

Male 5 8 4 689 069 1 129 223566666789 0002222348 2223445668 68

7 7 8 8 9 9 10 10 11 11 12 12 13 13

24 79 69 0133 6778 0022333344 555666777789 0000111122223334444 55688999 003344 677888 02 6

Solutions

63

1.46. The time plot on the right shows that women’s times decreased quite rapidly from 1972 until the mid-1980s. Since that time, they have been fairly consistent: Almost all times since 1986 are between 141 and 147 minutes.

Winning time (minutes)

1.45. Stemplot at right, with split stems. The distribution is skewed to the left, with center around 59.5. Most self-concept scores are between 35 and 73, with a few below that, and one high score of 80 (but not really high enough to be an outlier).

2 2 3 3 4 4 5 5 6 6 7 7 8

01 8 0 5679 02344 6799 1111223344444 556668899 00001233344444 55666677777899 0000111223 0

190 180 170 160 150 140 1970 1975 1980 1985 1990 1995 2000 2005 Year

1.47. The total for the 24 countries was 897 days, so with Suriname, it is 897 + 694 = 1591 days, and the mean is x = 1591 25 = 63.64 days. 1.48. The mean score is x =

821 = 82.1. 10

1.49. To find the ordered list of times, start with the 24 times in Example 1.23, and add 694 to the end of the list. The ordered times (with median highlighted) are 4, 11, 14, 23, 23, 23, 23, 24, 27, 29, 31, 33, 40 , 42, 44, 44, 44, 46, 47, 60, 61, 62, 65, 77, 694 The outlier increases the median from 36.5 to 40 days, but the change is much less than the outlier’s effect on the mean. 1.50. The median of the service times is 103.5 seconds. (This is the average of the 40th and 41st numbers in the sorted list, but for a set of 80 numbers, we assume that most students will compute the median using software, which does not require that the data be sorted.) 1.51. In order, the scores are: 55, 73, 75, 80, 80 , 85 , 90, 92, 93, 98 The middle two scores are 80 and 85, so the median is M =

80 + 85 = 82.5. 2

64

Chapter 1

Looking at Data—Distributions

1.52. See the ordered list given in the previous solution. The first quartile is Q 1 = 75, the median of the first five numbers: 55, 73, 75 , 80, 80. Similarly, Q 3 = 92, the median of the last five numbers: 85, 90, 92 , 93, 98. 1.53. The maximum and minimum can be found by inspecting the list. The sorted list (with quartile and median locations highlighted) is 1 19 55 75 104 140 201 372

2 25 56 76 106 141 203 386

2 30 57 76 115 143 211 438

3 35 59 77 116 148 225 465

4 40 64 80 118 148 274 479

9 44 67 88 121 157 277 700

9 48 68 89 126 178 289 700

9 51 73 90 128 179 290 951

11 52 73 102 137 182 325 1148

19 54 75 103 138 199 367 2631

This confirms the five-number summary (1, 54.5, 103.5, 200, and 2631 seconds) given in Example 1.26. The sum of the 80 numbers is 15,726 seconds, so the mean is x = 15,726 80 = 196.575 seconds (the value 197 in the text was rounded). Note: The most tedious part of this process is sorting the numbers and adding them all up. Unless you really want to confirm that your students can sort a list of 80 numbers, consider giving the students the sorted list of times, and checking their ability to identify the locations of the quartiles. 1.54. The median and quartiles were found earlier; the minimum and maximum are easy to locate in the ordered list of scores (see the solutions to Exercises 1.51 and 1.52), so the five-number summary is Min = 55, Q 1 = 75, M = 82.5, Q 3 = 92, Max = 98.

Min = 55, Q 1 = 75, M = 82.5, Q 3 = 92, Max = 98

Score on first exam

1.55. Use the five-number summary from the solution to Exercise 1.54:

95 90 85 80 75 70 65 60 55 50

1.56. The interquartile range is IQR = Q 3 − Q 1 = 92 − 75 = 17, so the 1.5 × IQR rule would consider as outliers scores outside the range Q 1 − 25.5 = 49.5 to Q 3 + 25.5 = 117.5. According to this rule, there are no outliers. 1.57. The variance can be computed from the formula s 2 =

1  (xi − x)2 ; for n−1

example, the first term in the sum would be (80 − 82.1)2 = 4.41. However, in practice, 1416.9 = 157.43 and software or a calculator is the preferred approach; this yields s 2 = 9 √ . s = s 2 = 12.5472.

Solutions

65

1.58. In order to have s = 0, all 5 cases must be equal; for example, 1, 1, 1, 1, 1, or 12.5, 12.5, 12.5, 12.5, 12.5. (If any two numbers are different, then xi − x would be nonzero for some i, so the sum of squared differences would be positive, so s 2 > 0, so s > 0.) 1.59. Without Suriname, the quartiles are 23 and 46.5 days; with Suriname included, they are 23 and 53.5 days. Therefore, the IQR increases from 23.5 to 30.5 days—a much less drastic change than the change in s (18.6 to 132.6 days). 1.60. Divide total score by 4:

950 = 237.5 points. 4

1.61. (a) Use a stemplot or histogram. (b) Because the distribution is skewed, the five-number summary is the best choice; in millions of dollars, it is Min 3338

Q1 4589

M 7558.5

Q3 13,416

Max 66,667

0 0 1 1 2 2 3 3 4 4 5 5 6 6

333333333333333333444444444444 55555555566666677777777778888889 00001112223333333 79 01111233 559 114 5

Some students might choose the less-appropriate . . summary: x = 12,144 and s = 12,421 mil3 lion dollars. (c) For example, the distribution 99 is sharply right-skewed. (This is not surprising 6 given that we are looking at the top 100 companies; the top fraction of most distributions will tend to be skewed to the right.) 1.62. (a) Either a stemplot x s Min Q1 M Q 3 Max or histogram can be used All points 4.7593 0.7523 0.4 4.30 4.7 5 6.5 to display the distribuNo O’Doul’s 4.8106 0.5864 3.8 4.35 4.7 5 6.5 tion. Two stemplots are shown on the following page: one with all points, and one with the outlier mentioned in part (b) excluded. In the table are the mean and standard deviation, as well as the five-number summary, both with and without the outlier (all values are percents). The latter is preferable because of the outlier; in particular, note the outlier’s effect on the standard deviation. (See also the solution to the next exercise.) (b) O’Doul’s is marketed as “non-alcoholic” beer. Note: In federal regulations, part of the definition of beer is that it has at least 0.5% alcohol. By that standard, O’Doul’s is a low-alcohol beverage, but it is not beer.

66

Chapter 1 All points 0 4 0 1 1 2 2 3 3 88 4 11111122222222223334444 4 555555666667777777778889999999999 5 000000011224 5 5666688999999 6 1 6 5

Looking at Data—Distributions Without O’Doul’s 3 88 4 111111 4 2222222222333 4 4444555555 4 66666777777777 4 8889999999999 5 000000011 5 22 5 45 5 6666 5 88999999 6 1 6 6 5

1.63. All of these numbers are given in the table in the solution to the previous exercise. (a) x changes from 4.76% (with) to 4.81% (without); the median (4.7%) does not change. (b) s changes from 0.7523% to 0.5864%; Q 1 changes from 4.3% to 4.35%, while Q 3 = 5% does not change. (c) A low outlier decreases x; any kind of outlier increases s. Outliers have little or no effect on the median and quartiles. 1.64. (a) A stemplot or histogram can be used to display the distribution. Students may report either mean/standard deviation or the five-number summary (in units of calories): x 141.06

s 27.79

Min 70

Q1 113

M 145.5

Q3 157

Max 210

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

0 4556889 2458 00000000334 08 0235558 22333444555666788899 0012233356777 00012336669 01459 8 5 00 0

(b) O’Doul’s has the fewest calories (70) of these 86 beers. (c) Nearly all the beers with fewer than 120 calories are marketed as light beers (and most have “light” in their names). Of the other beers, only one (Weinhard’s Amber Light) is called “light.” Note: If we apply the 1.5 × IQR rule to all 86 beers, O’Doul’s does not qualify as an outlier (the cutoff is 47). However, if we restrict our attention to the light beers (fewer than 120 calories), any beer below 80 calories is an outlier.

1.65. Use a small data set with an odd number of points, so that the median is the middle number. After deleting the lowest observation, the median will be the average of that middle number and the next number after it; if that latter number is much larger, the median will change substantially. For example, start with 0, 1, 2 , 998, 1000; after removing 0, the median changes from 2 to 500. 1.66. Salary distributions (especially in professional sports) tend to be skewed to the right. This skew makes the mean higher than the median.

Solutions

67

1.67. (a) The distribution is left-skewed. While the skew makes the five-number summary is preferable, some students might give the mean/standard deviation. In ounces, these statistics are: x 6.456

s 1.425

Min 3.7

Q1 4.95

M 6.7

Q3 7.85

3 4 4 5 5 6 6 7 7 8

Max 8.2

7 3 7777 23 0033 7 03 668899999 2

(b) The numerical summary does not reveal the two weight clusters (visible in a stemplot or histogram). (c) For small potatoes (less than 6 oz), n = 8, x = 4.662 oz, and s = 0.501 oz. For large potatoes, n = 17, x = 7.300 oz, and s = 0.755 oz. Because there are clearly two groups, it seems appropriate to treat them separately.

70 60 50 40 30

Frequency

Diameter at breast height (cm)

1.68. (a) The five-number summary is Min = 2.2 cm, Q 1 = 10.95 cm, M = 28.5 cm, Q 3 = 41.9 cm, Max = 69.3 cm. (b) & (c) The boxplot and histogram are shown below. (Students might choose different interval widths for the histogram.) (d) Preferences will vary. Both plots reveal the right-skew of this distribution, but the boxplot does not show the two peaks visible in the histogram.

20 10

9 8 7 6 5 4 3 2 1 0 0

0

10

20 30 40 50 60 70 Diameter at breast height (cm)

80

70

30

60

25

50 40 30

Frequency

CRP (mg/l)

1.69. (a) The five-number summary is Min = 0 mg/l, Q 1 = 0 mg/l, M = 5.085 mg/l, Q 3 = 9.47 mg/l, Max = 73.2 mg/l. (b) & (c) The boxplot and histogram are shown below. (Students might choose different interval widths for the histogram.) (d) Preferences will vary. Both plots reveal the sharp right-skew of this distribution, but because Min = Q 1 , the boxplot looks somewhat strange. The histogram seems to convey the distribution better.

20 15 10

20

5

10

0

0

0

10

20

30

40 50 60 CRP (mg/l)

70

80

90

1.70. Answers depend on whether natural (base-e) or common (base-10) logarithms are used. Both sets of answers are shown here. If this exercise is assigned, it would probably be best for the sanity of both instructor and students to specify which logarithm to use. (a) The five-number summary is: Logarithm Natural Common

Min 0 0

Q1 0 0

M 1.8048 0.7838

Q3 2.3485 1.0199

Max 4.3068 1.8704

Looking at Data—Distributions

4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

2 Base-10 log of (1+CRP)

Chapter 1

Natural log of (1+CRP)

68

1.5 1 0.5 0

16 14 12 10 8 6 4 2 0

Frequency

Frequency

. (The ratio between these answers is roughly ln 10 = 2.3.) (b) & (c) The boxplots and histograms are shown below. (Students might choose different interval widths for the histograms.) (d) As for Exercise 1.69, preferences will vary.

0

0.5

1

1.5 2 2.5 3 3.5 4 Natural log of (1+CRP)

4.5

16 14 12 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Base-10 log of (1+CRP)

5

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

14 12 Frequency

Retinol level (µmol/l)

1.71. (a) The five-number summary (in units of µmol/l) is Min = 0.24, Q 1 = 0.355, M = 0.76, Q 3 = 1.03, Max = 1.9. (b) & (c) The boxplot and histogram are shown below. (Students might choose different interval widths for the histogram.) (d) The distribution is right-skewed. A histogram (or stemplot) is preferable because it reveals an important feature not evident from a boxplot: This distribution has two peaks.

10 8 6 4 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Retinol level (µmol/l)

2

2.2

Solutions

69

1.72. The mean and standard deviation for these ratings are . x = 5.9 and s = 3.7719; the five-number summary is Min = Q 1 = 1, M = 6.5, Q 3 = Max = 10. For a graphical presentation, a stemplot (or histogram) is better than a boxplot because the latter obscures details about the distribution. (With a little thought, one might realize that Min = Q 1 = 1 and Q 3 = Max = 10 means that there are lots of 1’s and lots of 10’s, but this is much more evident in a stemplot or histogram.)

1 2 3 4 5 6 7 8 9 10

0000000000000000 0000 0 0 00000 000 0 000000 00000 000000000000000000

1.73. The distribution of household net worth would almost surely be strongly skewed to the right: Most families would generally have accumulated little or modest wealth, but a few would have become rich. This strong skew pulls the mean to be higher than the median. 1.74. See also the solution to Exercise 1.36. (a) The fivenumber summary (in units of metric tons per person) is: Min = 0, Q 1 = 0.75, M = 3.2, Q 3 = 7.8, Max = 19.9 The evidence for the skew is in the large gaps between the higher numbers; that is, the differences Q 3 − M and Max− Q 3 are large compared to Q 1 − Min and M − Q 1 . (b) The IQR is Q 3 − Q 1 = 7.05, so outliers would be less than −9.825 or greater than 18.375. According to this rule, only the United States qualifies as an outlier, but Canada and Australia seem high enough to also include them.

0 0 0 0 0 1 1 1 1 1

00000000000000011111 222233333 445 6677 888999 001 67 9

. 1.75. The total salary is $690,000, so the mean is x = $690,000 = $76,667. Six of the nine 9 employees earn less than the mean. The median is M = $35,000. 1.76. If three individuals earn $0, $0, and $20,000, the reported median is $20,000. If the two individuals with no income take jobs at $14,000 each, the median decreases to $14,000. The same thing can happen to the mean: In this example, the mean drops from $20,000 to $16,000. 1.77. The total salary is now $825,000, so the new mean is x = median is unchanged. 1.78. Details at right. 11,200 = 1600 7 214,872 = 35,812 and s2 = 6  .

x=

s=

35,812 = 189.24

$825,000 9

xi 1792 1666 1362 1614 1460 1867 1439 11200

. = $91,667. The

xi − x 192 66 −238 14 −140 267 −161 0

(xi − x)2 36864 4356 56644 196 19600 71289 25921 214872

70

Chapter 1

Looking at Data—Distributions

1.79. The quote describes a distribution with a strong right skew: Lots of years with no losses to hurricane ($0), but very high numbers when they do occur. For example, if there is one hurricane in a 10-year period causing $1 million in damages, the “average annual loss” for that period would be $100,000, but that does not adequately represent the cost for the year of the hurricane. Means are not the appropriate measure of center for skewed distributions. Women Men 1.80. (a) x and s are appropriate for symmetric disx s x s tributions with no outliers. (b) Both high numbers Before 165.2 56.5 117.2 74.2 are flagged as outliers. For women, IQR = 60, After 158.4 43.7 110.9 66.9 so the upper 1.5 × IQR limit is 300 minutes. For men, IQR = 90, so the upper 1.5 × IQR limit is 285 minutes. The table on the right shows the effect of removing these outliers.

1.81. (a) & (b) See the table on the right. In both cases, the mean and median are quite similar.

pH Density

x 5.4256 5.4479

s 0.5379 0.2209

M 5.44 5.46

1.82. See also the solution to Exercise 1.43. (a) The mean of x s M IQ 108.9 13.17 110 this distribution appears to be higher than 100. (There is GPA 7.447 (2.1) 7.829 no substantial difference between the standard deviations.) (b) The mean and median are quite similar; the mean is slightly smaller due to the slight left skew of the data. (c) In addition to the mean and median, the standard deviation is shown for reference (the exercise did not ask for it). Note: Students may be somewhat puzzled by the statement in (b) that the median is “close to the mean” (when they differ by 1.1), followed by (c), where they “differ a bit” (when M − x = 0.382). It may be useful to emphasize that we judge the size of such differ. 1.1 ences relative to the spread of the distribution. For example, we can note that 13.17 = 0.08 . for (b), and 0.382 2.1 = 0.18 for (c). 1.83. With only two observations, the mean and median are always equal because the median is halfway between the middle two (in this case, the only two) numbers. 1.84. (a) The mean (green arrow) moves along with the moving point (in fact, it moves in the same direction as the moving point, at one-third the speed). At the same time, as long as the moving point remains to the right of the other two, the median (red arrow) points to the middle point (the rightmost nonmoving point). (b) The mean follows the moving point as before. When the moving point passes the rightmost fixed point, the median slides along with it until the moving point passes the leftmost fixed point, then the median stays there. 1.85. (a) There are several different answers, depending on the configuration of the first five points. Most students will likely assume that the first five points should be distinct (no repeats), in which case the sixth point must be placed at the median. This is because the median of 5 (sorted) points is the third, while the median of 6 points is the average of the third and fourth. If these are to be the same, the third and fourth points of the set of six must both equal the third point of the set of five. The diagram below illustrates all of the possibilities; in each case, the arrow shows the

Solutions

71

location of the median of the initial five points, and the shaded region (or dot) on the line indicates where the sixth point can be placed without changing the median. Notice that there are four cases where the median does not change, regardless of the location of the sixth point. (The points need not be equally spaced; these diagrams were drawn that way for convenience.)

(b) Regardless of the configuration of the first five points, if the sixth point is added so as to leave the median unchanged, then in that (sorted) set of six, the third and fourth points must be equal. One of these two points will be the middle (fourth) point of the (sorted) set of seven, no matter where the seventh point is placed. Note: If you have a student who illustrates all possible cases above, then it is likely that the student either (1) obtained a copy of this solutions manual, (2) should consider a career in writing solutions manuals, (3) has too much time on his or her hands, or (4) both 2 and 3 (and perhaps 1) are true. 1.86. The five-number summaries (all in millimeters) are: Q1 46.71 38.07 35.45

M 47.12 39.16 36.11

Q3 48.245 41.69 36.82

Max 50.26 43.09 38.13

H. bihai is clearly the tallest variety—the shortest bihai was over 3 mm taller than the tallest red. Red is generally taller than yellow, with a few exceptions. Another noteworthy fact: The red variety is more variable than either of the other varieties. 1.87. (a) The means and standard deviations (all in millimeters) are: Variety bihai red yellow

x 47.5975 39.7113 36.1800

s 1.2129 1.7988 0.9753

bihai 46 3466789 47 114 48 0133 49 50 12

48 Length (mm)

bihai red yellow

Min 46.34 37.40 34.57

50 46 44 42 40 38 36 34 bihai red yellow Heliconia variety

red 37 38 39 40 41 42 43

4789 0012278 167 56 4699 01 0

yellow 34 56 35 146 36 0015678 37 01 38 1

(b) Bihai and red appear to be right-skewed (although it is difficult to tell with such small samples). Skewness would make these distributions unsuitable for x and s.

72

Chapter 1

Looking at Data—Distributions

. 1.88. (a) The mean is x = 15, and the standard deviation is s = 5.4365. (b) The mean is still 15; the new standard deviation is 3.7417. (c) Using the mean as a substitute for missing data will not change the mean, but it decreases the standard deviation. 1.89. The minimum and maximum are easily determined to be 1 and 12 letters, and the quartiles and median can be found by adding up the bar heights. For example, the first two bars have total height 22.3% (less than 25%), and adding the third bar brings the total to 45%, so Q 1 must equal 3 letters. Continuing this way, we find that the five-number summary, in units of letters, is: Min = 1, Q 1 = 3, M = 4, Q 3 = 5, Max = 12 Note that even without the frequency table given in the data file, we could draw the same conclusion by estimating the heights of the bars in the histogram. 1.90. Because the mean is to be 7, the five numbers must add up to 35. Also, the third number (in order from smallest to largest) must be 10 because that is the median. Beyond that, there is some freedom in how the numbers are chosen. Note: It is likely that many students will interpret “positive numbers” as meaning positive integers only, which leads to eight possible solutions, shown below. 1 1 10 10 13 1 3 10 10 11

1 1 10 11 12 1 4 10 10 10

1 2 10 10 12 2 2 10 10 11

1 2 10 11 11 2 3 10 10 10

1.91. The simplest approach is to take (at least) six numbers—say, a, b, c, d, e, f in increasing order. For this set, Q 3 = e; we can cause the mean to be larger than e by simply choosing f to be much larger than e. For example, if all numbers are nonnegative, f > 5e would accomplish the goal because then e+ f e + 5e a+b+c+d +e+ f > > = e. x= 6 6 6 1.92. The algebra might be a bit of a stretch for some students: =

(x1 − x) +

(x2 − x) +

(x3 − x) + · · · + (xn−1 − x) +

(xn − x)

x1 − x +

x2 − x +

x3 − x + · · · + xn−1 − x +

xn − x

(drop all the parentheses)

=

x1 + x2 + x3 + · · · + xn−1 + xn

− x − x − x − ··· − x − x

=

x1 + x2 + x3 + · · · + xn−1 + xn

−n·x

(rearrange the terms)

Next, simply observe that n · x = x1 + x2 + x3 + · · · + xn−1 + xn . 1.93. (a) One possible answer is 1, 1, 1, 1. (b) 0, 0, 20, 20. (c) For (a), any set of four identical numbers will have s = 0. For (b), the answer is unique; here is a rough description of why. We want to maximize the “spread-out”-ness of the numbers (which is what standard deviation measures), so 0 and 20 seem to be reasonable choices based on that idea. We also want to make each individual squared deviation—(x1 − x)2 , (x2 − x)2 , (x3 − x)2 , and (x4 − x)2 —as large as possible. If we choose 0, 20, 20, 20—or 20, 0, 0, 0—we make the

Solutions

73

first squared deviation 152 , but the other three are only 52 . Our best choice is two at each extreme, which makes all four squared deviations equal to 102 . 1.94. Answers will vary. Typical calculators will carry only about 12 to 15 digits; for example, a TI-83 fails (gives s = 0) for 14-digit numbers. Excel (at least the version I checked) also fails for 14-digit numbers, but it gives s = 262,144 rather than 0. The (very old) version of Minitab used to prepare these answers fails at 20,000,001 (eight digits), giving s = 2. 1.95. The table on the right reproduces the (in mm) (in inches) Variety x s x s means and standard deviations from the bihai 47.5975 1.2129 1.874 0.04775 solution to Exercise 1.87 and shows those red 39.7113 1.7988 1.563 0.07082 values expressed in inches. For each converyellow 36.1800 0.9753 1.424 0.03840 sion, multiply by 39.37/1000 = 0.03937 (or divide by 25.4—an inch is defined as 25.4 millimeters). For example, for the bihai variety, x = (47.5975 mm)(0.03937 in/mm) = (47.5975 mm) ÷ (25.4 mm/in) = 1.874 in. 1.96. (a) x = 5.4479 and s = 0.2209. (b) The first measurement corresponds to 5.50 × 62.43 = 343.365 pounds per cubic foot. To find x new and snew , we similarly multiply . . by 62.43: x new = 340.11 and snew = 13.79. Note: The conversion from cm to feet is included in the multiplication by 62.43; the step-by-step process of this conversion looks like this: (1 g/cm3 )(0.001 kg/g)(2.2046 lb/kg)(30.483 cm3/ft3 ) = 62.43 lb/ft3 . 1.97. Convert from kilograms to pounds by multiplying by 2.2: x = (2.42 kg)(2.2 lb/kg) = . 5.32 lb and s = (1.18 kg)(2.2 lb/kg) = 2.60 lb. 1.98. Variance is changed by a factor of 2.542 = 6.4516; generally, for a transformation xnew = a + bx, the new variance is b2 times the old variance. 1.99. There are 80 service times, so to find the 10% trimmed mean, remove the highest and lowest eight values (leaving 64). Remove the highest and lowest 16 values (leaving 48) for the 20% trimmed mean. The mean and median for the full data set are x = 196.575 and M = 103.5 minutes. The . . 10% trimmed mean is x ∗ = 127.734, and the 20% trimmed mean is x ∗∗ = 111.917 minutes. Because the distribution is right-skewed, removing the extremes lowers the mean.

74

Chapter 1

Looking at Data—Distributions

12

25

10 20

Frequency

Diameter at breast height (in)

1.100. After changing the scale from centimeters to inches, the five-number summary values change by the same ratio (that is, they are multiplied by 0.39). The shape of the histogram might change slightly because of the change in class intervals. (a) The five-number summary (in inches) is Min = 0.858, Q 1 = 4.2705, M = 11.115, Q 3 = 16.341, Max = 27.027. (b) & (c) The boxplot and histogram are shown below. (Students might choose different interval widths for the histogram.) (d) As in Exercise 1.56, the histogram reveals more detail about the shape of the distribution.

15 10

8 6 4 2

5

0 0

0

5

10 15 20 25 30 Diameter at breast height (in)

35

1.101. Take the mean plus or minus two standard deviations: 572 ± 2(51) = 470 to 674. 1.102. Take the mean plus or minus three standard deviations: 572 ± 3(51) = 419 to 725. 1.103. The z-score is z =

620 − 572 51

. = 0.94.

− 572 . 1.104. The z-score is z = 510 51 = −1.22. This is negative because an ISTEP score of 510 is below average; specifically, it is 1.22 standard deviations below the mean.

. 1.105. Using Table A, the proportion below 620 (z = 0.94) is 0.8264 and the proportion at or above is 0.1736; these two proportions add to 1. The graph on the right illustrates this with a single curve; it conveys essentially the same idea as the “graphical subtraction” picture shown in Example 1.36. . 1.106. Using Table A, the proportion below 620 (z = 0.94) . is 0.8264, and the proportion below 660 (z = 1.73) is 0.9582. Therefore:

620 0.8264

419

470

0.1736

521

572

623

674

725

620 660 0.8264 0.9582

area between area left area left = − 620 and 660 of 660 of 620 0.1318

=

0.9582

419

470

521

572

623

674

0.8264

The graph on the right illustrates this with a single curve; it conveys essentially the same idea as the “graphical subtraction” picture shown in Example 1.37.

725

Solutions

75

. 1.107. Using Table A, this ISTEP score should correspond to a standard score of z = 0.67 . (software gives 0.6745), so the ISTEP score (unstandardized) is 572 + 0.67(51) = 606.2 (software: 606.4). . 1.108. Using Table A, x should correspond to a standard score of z = −0.84 (software gives . −0.8416), so the ISTEP score (unstandardized) is x = 572 − 0.84(51) = 529.2 (software: 529.1). 1.109. Of course, student sketches will not be as neat as the curves on the right, but they should have roughly the correct shape. (a) It is easiest to draw the curve 1 4 7 10 13 16 19 22 25 28 first, and then mark the scale on the axis. (b) Draw a copy of the first curve, with the peak over 20. (c) The curve has the same shape, but is translated left or right. 1.110. (a) As in the previous exercise, draw the curve first, and then mark the scale on the axis. (b) In order to have a standard deviation of 1, the curve should be 1/3 as wide, and three times taller. (c) The curve is centered at the same place (the mean), but its height and width change. Specifically, increasing the standard deviation makes the curve wider and shorter; decreasing the standard deviation makes the curve narrower and taller.

1

4

7

10

13

16

19

1.111. Sketches will vary. Women Men 1.112. (a) The table on the right gives the 68% 7856 to 20,738 4995 to 23,125 ranges for women; for example, about 68% 95% 1415 to 27,179 −4070 to 32,190 of women speak between 7856 and 20,738 99.7% −5026 to 33,620 −13,135 to 41,255 words per day. (b) Negative numbers do not make sense for this situation. The 68–95–99.7 rule is reasonable for a distribution that is close to Normal, but by constructing a stemplot or histogram, it is easily confirmed that this distribution is slightly right-skewed. (c) These ranges are also in the table; the men’s distribution is more skewed than the women’s distribution, so the 68–95–99.7 rule is even less appropriate. (d) This does not support the conventional wisdom: The ranges from parts (a) and (c) overlap quite a bit. Additionally, the difference in the means is quite small relative to the large standard deviations.

76

Chapter 1

Looking at Data—Distributions

Women Men 1.113. (a) Ranges are given in the table on 68% 8489 to 20,919 7158 to 22,886 the right. In both cases, some of the lower 95% 2274 to 27,134 −706 to 30,750 limits are negative, which does not make 99.7% −3941 to 33,349 −8,570 to 38,614 sense; this happens because the women’s distribution is skewed, and the men’s distribution has an outlier. Contrary to the conventional wisdom, the men’s mean is slightly higher, although the outlier is at least partly responsible for that. (b) The means suggest that Mexican men and women tend to speak more than people of the same gender from the United States.

1.114. (a) For example, 68−70 = −0.2. The complete list is given on the right. 10 (b) The cut-off for an A is the 85th percentile for the N (0, 1) distribution. From Table A, this is approximately 1.04; software gives 1.0364. (c) The top two students (with scores of 92 and 98) received A’s.

68 54 92 75 73 98 64 55 80 70

−0.2 −1.6 2.2 0.5 0.3 2.8 −0.6 −1.5 1 0

1.115. (a) We need the 5th, 15th, 55th, and Table A Software 85th percentiles for a N (0, 1) distribuStandard Actual Standard Actual tion. These are given in the table on the F −1.64 53.6 −1.6449 53.55 D −1.04 59.6 −1.0364 59.64 right. (b) To convert to actual scores, take C 0.13 71.3 0.1257 71.26 the standard-score cut-off z and compute B 1.04 80.4 1.0364 80.36 10z + 70. (c) Opinions will vary. Note: The cut-off for an A given in the previous solution is the lowest score that gets an A—that is, the point where one’s grade drops from an A to a B. These cut-offs are the points where one’s grade jumps up. In practice, this is only an issue for a score that falls exactly on the border between two grades. 1.116. (a) The curve forms a 1 × 1 square, which has area 1. (b) P(X < 0.35) = 0.35. (c) P(0.35 < X < 0.65) = 0.3. 0

1.117. (a) The height should be 14 since the area under the curve must be 1. The density curve is on the right. (b) P(X ≤ 1) = 14 = 0.25. (c) P(0.5 < X < 2.5) = 0.5.

0

0.35

1

1

0

2

0.35 0.65

3

1

4

1.118. The mean and median both equal 0.5; the quartiles are Q 1 = 0.25 and Q 3 = 0.75. 1.119. (a) Mean is C, median is B (the right skew pulls the mean to the right). (b) Mean A, median A. (c) Mean A, median B (the left skew pulls the mean to the left).

Solutions 1.120. Hint: It is best to draw the curve first, then place the numbers below it. Students may at first make mistakes like drawing a half-circle instead of the correct “bell-shaped” curve, or being careless about locating the standard deviation.

77

218

234

250

266

282

298

314

1.121. (a) The applet shows an area of 0.6826 between −1.000 and 1.000, while the 68–95–99.7 rule rounds this to 0.68. (b) Between −2.000 and 2.000, the applet reports 0.9544 (compared to the rounded 0.95 from the 68–95–99.7 rule). Between −3.000 and 3.000, the applet reports 0.9974 (compared to the rounded 0.997).

1.122. See the sketch of the curve in the solution to Exercise 1.120. (a) The middle 95% fall within two standard deviations of the mean: 266 ± 2(16), or 234 to 298 days. (b) The shortest 2.5% of pregnancies are shorter than 234 days (more than two standard deviations below the mean). 1.123. (a) 99.7% of horse pregnancies fall within three standard deviations of the mean: 336 ± 3(3), or 327 to 325 days. (b) About 16% are longer than 339 days since 339 days or more corresponds to at least one standard devia327 330 333 336 339 342 345 tion above the mean. Note: This exercise did not ask for a sketch of the Normal curve, but students should be encouraged to make such sketches anyway. 1.124. Because the quartiles of any distribution have 50% of observations between them, we seek to place the flags so that the reported area is 0.5. The closest the applet gets is an area of 0.5034, between −0.680 and 0.680. Thus, the quartiles of any Normal distribution are about 0.68 standard deviations above and below the mean. Note: Table A places the quartiles at about ±0.67; other statistical software gives ±0.6745. 1.125. The mean and standard deviation are x = 5.4256 and s = 0.5379. About 67.62% . (71/105 = 0.6476) of the pH measurements are in the range x ± s = 4.89 to 5.96. About 95.24% (100/105) are in the range x ± 2s = 4.35 to 6.50. All (100%) are in the range x ± 3s = 3.81 to 7.04.

78

Chapter 1

1.126. Using values from Table A: (a) Z > 1.65: 0.0495. (b) Z < 1.65: 0.9505. (c) Z > −0.76: 0.7764. (d) −0.76 < Z < 1.65: 0.9505 − 0.2236 = 0.7269.

(a)

Looking at Data—Distributions

(b) 1.65

–3

–2

–1

0

1

2

1.65

3

–0.76

(c)

–3

–2

–1

0

1

2

3

–0.76

(d)

1.65

–3

1.127. Using values from Table A: (a) Z ≤ −1.8: 0.0359. (b) Z ≥ −1.8: 0.9641. (c) Z > 1.6: 0.0548. (d) −1.8 < Z < 1.6: 0.9452 − 0.0359 = 0.9093.

–2

–1

0

1

2

3

(a)

–3

–2

–1

0

1

2

3

–3

1.129. (a) z = 0.3853 has cumulative pro(a) portion 0.65 (that is, 0.3853 is the 65th percentile of the standard Normal distribution). (b) If z = 0.1257, then Z > z has –3 proportion 0.45 (0.1257 is the 55th percentile).

1

2

3

–2

–1

0

1

2

3

(d) 1.6

1.128. (a) 22% of the observations fall below (a) −0.7722. (This is the 22nd percentile of the standard Normal distribution.) (b) 40% of –3 the observations fall above 0.2533 (the 60th percentile of the standard Normal distribution).

0

–1.8

(c)

–3

–1

(b) –1.8

–3

–2

–2

–1

0

1

2

1.6

–1.8

3

–3

–2

–1

0

1

2

3

(b) 0.22

–2

0.40

–1

0

1

2

3

–3

–2

–1

0

1

2

3

(b) 0.65

–2

0.45

–1

0

1

2

3

–3

–2

–1

0

1

2

3

1.130. 70 is two standard deviations below the mean (that is, it has standard score z = −2), so about 2.5% (half of the outer 5%) of adults would have WAIS scores below 70. 1.131. 130 is two standard deviations above the mean (that is, it has standard score z = 2), so about 2.5% of adults would score at least 130. − 1509 . 1.132. Tonya’s score standardizes to z = 1820321 = 0.9688, while Jermaine’s score . 29 − 21.5 corresponds to z = 5.4 = 1.3889. Jermaine’s score is higher.

. 1.133. Jacob’s score standardizes to z = 16 −5.421.5 = −1.0185, while Emily’s score corresponds . − 1509 to z = 1020321 = −1.5234. Jacob’s score is higher. − 1509 . 1.134. Jose’s score standardizes to z = 2080321 = 1.7788, so an equivalent ACT score is . 21.5 + 1.7788 × 5.4 = 31.1. (Of course, ACT scores are reported as whole numbers, so this would presumably be a score of 31.)

Solutions

79 . = 1.5741, so an equivalent SAT score is

1.135. Maria’s score standardizes to z = . 1509 + 1.5741 × 321 = 2014.

30 − 21.5 5.4

1.136. Maria’s score standardizes to z = Her score is the 96.5 percentile.

2090 − 1509 321

1.137. Jacob’s score standardizes to z = His score is the 32.3 percentile.

19 − 21.5 5.4

. = 1.81, for which Table A gives 0.9649.

. = −0.4630, for which Table A gives 0.3228.

1.138. 1920 and above: The top 10% corresponds to a standard score of z = 1.2816, which in . turn corresponds to a score of 1509 + 1.2816 × 321 = 1920 on the SAT. 1.139. 1239 and below: The bottom 20% corresponds to a standard score of z = −0.8416, . which in turn corresponds to a score of 1509 − 0.8416 × 321 = 1239 on the SAT. 1.140. The quartiles of a Normal distribution are ±0.6745 standard deviations from the mean, . so for ACT scores, they are 21.5 ± 0.6745 × 5.4 = 17.9 to 25.1. 1.141. The quintiles of the SAT score distribution are 1509 − 0.8416 × 321 = 1239, 1509 − 0.2533 × 321 = 1428, 1509 + 0.2533 × 321 = 1590, and 1509 + 0.8416 × 321 = 1779. 1.142. For a Normal distribution with mean 55 mg/dl and standard deviation 15.5 mg/dl: − 55 . (a) 40 mg/dl standardizes to z = 4015.5 = −0.9677. Using Table A, 16.60% of women fall − 55 . = 0.3226. below this level (software: 16.66%). (b) 60 mg/dl standardizes to z = 6015.5 Using Table A, 37.45(c) Subtract the answers from (a) and (b) from 100%: Table A gives 45.95% (software: 45.99%), so about 46% of women fall in the intermediate range. 1.143. For a Normal distribution with mean 46 mg/dl and standard deviation 13.6 mg/dl: − 46 . (a) 40 mg/dl standardizes to z = 4013.6 = −0.4412. Using Table A, 33% of men fall below − 46 . this level (software: 32.95%). (b) 60 mg/dl standardizes to z = 6013.6 = 1.0294. Using Table A, 15.15(c) Subtract the answers from (a) and (b) from 100%: Table A gives 51.85% (software: 51.88%), so about 52% of men fall in the intermediate range. 1.144. (a) About 0.6% of healthy young adults have osteoporosis (the cumulative probability below a standard score of −2.5 is 0.0062). (b) About 31% of this population of older women has osteoporosis: The BMD level which is 2.5 standard deviations below the young adult mean would standardize to −0.5 for these older women, and the cumulative probability for this standard score is 0.3085. 1.145. (a) About 5.2%: x < 240 corresponds to z < −1.625. Table A gives 5.16% for −1.63 and 5.26% for −1.62. Software (or averaging the two table values) gives 5.21%. (b) About 54.7%: 240 < x < 270 corresponds to −1.625 < z < 0.25. The area to the left of 0.25 is 0.5987; subtracting the answer from part (a) leaves about 54.7%. (c) About 279 days or longer: Searching Table A for 0.80 leads to z > 0.84, which corresponds to x > 266 + 0.84(16) = 279.44. (Using the software value z > 0.8416 gives x > 279.47.)

80

Chapter 1

Looking at Data—Distributions

1.146. (a) The quartiles for a standard Normal distribution are ±0.6745. (b) For a N (µ, σ ) distribution, Q 1 = µ − 0.6745σ and Q 3 = µ + 0.6745σ . (c) For human pregnancies, . . Q 1 = 266 − 0.6745 × 16 = 255.2 and Q 3 = 266 + 0.67455 × 16 = 276.8 days. 1.147. (a) As the quartiles for a standard Normal distribution are ±0.6745, we have IQR = 1.3490. (b) c = 1.3490: For a N (µ, σ ) distribution, the quartiles are Q 1 = µ − 0.6745σ and Q 3 = µ + 0.6745σ . 1.148. In the previous two exercises, we found that for a N (µ, σ ) distribution, Q 1 = µ − 0.6745σ , Q 3 = µ + 0.6745σ , and IQR = 1.3490σ . Therefore, 1.5 × IQR = 2.0235σ , and the suspected outliers are below Q 1 − 1.5 × IQR = µ − 2.698σ , and above Q 3 + 1.5 × IQR = µ + 2.698σ . The percentage outside of this range is 2 × 0.0035 = 0.70%. 1.149. (a) The first and last deciles for a standard Normal distribution are ±1.2816. (b) For . a N (9.12, 0.15) distribution, the first and last deciles are µ − 1.2816σ = 8.93 and . µ + 1.2816σ = 9.31 ounces. 1.150. The shape of the quantile plot suggests that the data are right-skewed (as was observed in Exercises 1.36 and 1.74). This can be seen in the flat section in the lower left—these numbers were less spread out than they should be for Normal data—and the three apparent outliers (the United States, Canada, and Australia) that deviate from the line in the upper right; these were much larger than they would be for a Normal distribution. 1.151. (a) The plot is reasonably linear except for the point in the upper right, so this distribution is roughly Normal, but with a high outlier. (b) The plot is fairly linear, so the distribution is roughly Normal. (c) The plot curves up to the right—that is, the large values of this distribution are larger than they would be in a Normal distribution—so the distribution is skewed to the right.

5.8 5.6 Density

1.152. See also the solution to Exercise 1.42. The plot suggests no major deviations from Normality, although the three lowest measurements do not quite fall in line with the other points.

5.4 5.2 5 4.8 –3

–2

–1 0 1 Normal score

2

3

Solutions

81

1.153. (a) All three quantile plots are below; the yellow variety is the nearest to a straight line. (b) The other two distributions are slightly right-skewed (the lower-left portion of the graph is somewhat flat); additionally, the bihai variety appears to have a couple of high outliers. H. caribaea red

43

38 H. caribaea yellow

42

49

37

41 48

36

40 39

47

35

38

46

37 –3

–2

–1 0 1 Normal score

2

3

34 –3

–2

–1 0 1 Normal score

2

3

–3

–2

–1 0 1 Normal score

2

3

1.154. Shown are a histogram and quantile plot for one sample of 200 simulated N (0, 1) points. Histograms will vary slightly but should suggest a bell curve. The Normal quantile plot shows something fairly close to a line but illustrates that, even for actual Normal data, the tails may deviate slightly from a line. 3 Simulated values

50

Frequency

40 30 20

2 1 0

–1

10

–2 –3

0

–3

–3 –2.5 –2 –1.5 –1 –0.5 0 0.5 1 1.5 2 2.5 3 Simulated values

–2

–1 0 1 Normal score

2

3

1.155. Shown are a histogram and quantile plot for one sample of 200 simulated uniform data points. Histograms will vary slightly but should suggest the density curve of Figure 1.34 (but with more variation than students might expect). The Normal quantile plot shows that, compared to a Normal distribution, the uniform distribution does not extend as low or as high (not surprising, since all observations are between 0 and 1).

Simulated values

25 Frequency

Flower length (mm)

50 H. bihai

20 15 10 5 0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Simulated values

1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –3

–2

–1 0 1 Normal score

2

3

82

Chapter 1

Looking at Data—Distributions

1.156. Shown is a back-to-back stemplot; the distributions could also be compared with histograms or boxplots. Either mean/standard deviation or the five-number summary could be used; both are given below. Both the graphical and numerical descriptions reveal that hatchbacks generally have higher fuel efficiency (and also are more variable).

Hatchback

00

Hatchback

x 22.548

s 3.423

Min 16

Q1 20

M 21.5

Q3 25

Max 30

Large sedan

16.571

1.425

13

16

17.0

17

19

0 000 00000000 0000000 00 00 00000 000 0 00000 0 0 0

1.157. (a) The distribution appears to be roughly Normal. (b) One could justify using either the mean and standard deviation or the five-number summary: x 15.27%

s 3.118%

Min 8.2%

Q1 13%

M 15.5%

Q3 17.6%

Max 22.8%

(c) For example, binge drinking rates are typically 10% to 20%. Which states are high, and which are low? One might also note the geographical distribution of states with high binge-drinking rates: The top six states (Wisconsin, North Dakota, Iowa, Minnesota, Illinois, and Nebraska) are all adjacent to one another.

1.158. (a) The stemplot on the right suggests that there are two groups of states: the under-23% and over-23% groups. Additionally, while they do not qualify as outliers, Oklahoma (16.3%) and Vermont (30%) stand out as notably low and high. (b) One could justify using either the mean and standard deviation or the five-number summary: x 23.71%

s 3.517%

Min 16.3%

Q1 20.8%

M 24.3%

Q3 26.4%

Max 30%

Neither summary reveals the two groups of states visible in the stemplot. (c) One could explore the connections (geographical, socioeconomic, etc.) between the states in the two groups; for example, the top group includes many northeastern states, while the bottom group includes quite a few southern states.

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Large sedan 00 00 00000000 0000000000 0000 00

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

28

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

3

58 34 023689 015788 0077 13466889 01567 45677789 8 148 2 6 8

14678 4679 268 346899 3488 12446 023468 02346 0455 355679 0

Solutions

83

Percent

1.159. Students might compare 100 Silver color preferences using 90 80 a stacked bar graph like White 70 that shown on the right, Gray 60 or side-by-side bars like 50 Black those below. (They could 40 also make six pie charts, Blue 30 but comparing slices across 20 Red pies is difficult.) Possible 10 Brown observations: white is con0 North South Europe China South Japan siderably less popular in Other America America Korea Europe, and gray is less common in China. Note: The orders of countries and colors is as given in the text, which is more-or-less arbitrary. (Colors are ordered by decreasing popularity in North America.)

North America

25

South America

Percent

20 15

Europe

10

China

5

South Korea

0

Japan Silver

White

Gray

Black

Blue

Red

Brown

Other

1.162. Using either a histogram or stemplot, we see that this distribution is sharply rightskewed. For this reason, the five-number summary is preferred. Min 0

Q1 3

M 12.5

Q3 34

Max 86

Frequency

Color 80 70 60 50 40 30 20 10 0

Some students might report the less. . appropriate x = 21.62 and s = 22.76. 0 10 20 30 40 50 60 70 80 90 100 From the histogram and five-number Internet users per hundred people summary, we can observe, for example, that many countries have fewer than 10 Internet users per 100 people. In 75% of countries, less than 1/3 of the population uses the Internet.

84

Chapter 1

Looking at Data—Distributions

1.163. The distribution is somewhat right-skewed (although considerably less than the distribution with all countries) with only one country (Bosnia and Herzegovina) in the 20’s. Because of the irregular shape, students might choose either the mean/standard deviation or the five-number summary: x 39.85

s 22.05

Min 1.32

Q1 18.68

M 43.185

Q3 54.94

Max 85.65

Baltimore Boston Chicago Long Beach Los Angeles Miami Minneapolis New York Oakland Philadelphia San Francisco Washington, D.C.

7.82 8.26 4.02 6.25 8.07 3.67 14.87 6.23 9.30 7.04 7.61 13.12

Open space (acres)

40000 30000 20000 10000 0 ore Bos t Ch on i c a Lon g gB o Los each Ang ele s Min Miam nea i po Ne lis wY o Oa rk Phi kland la San delph Wa Fran ia shi ngt cisco on, D.C .

tim Bal

14 Acres of open space per 1000 people

Bal

14 12 10 8 6 4 2

12 10 8 6 4 2

Wa

ore tim

Min n shi eapo ngt on, lis D.C . Oa kla nd B Los oston Ang ele s B San altimo Fra re n Phi cisco lad e Lon lphia gB ea Ne ch wY o Ch rk ica go Mia mi

0 Bos t Ch on i c a Lon g gB o Los each Ang ele s Min Miam nea i po Ne lis wY o Oa rk Phi kland la San delph Wa Fran ia shi ngt cisco on, D.C .

0 Bal

Acres of open space per 1000 people

145789 23488889 5 0134467 124666669 022345688 223 026 15

50000

8000 7000 6000 5000 4000 3000 2000 1000 0 tim ore Bos t Ch on Lon icago gB Los each Ang ele s M Min iam nea i po Ne lis wY o Oa rk k l and Phi la San delph Wa Fran ia shi ngt cisco on, D.C .

Population (thousands)

1.164. (a) & (b) The graphs are below. Bars are shown in alphabetical order by city name (as the data were given in the table). . (c) For Baltimore, for example, this rate is 5091 651 = 7.82. The complete table is shown on the right. (d) & (e) Graphs below. Note that the text does not specify whether the bars should be ordered by increasing or decreasing rate. (f) Preferences may vary, but the ordered bars make comparisons easier.

0 1 2 3 4 5 6 7 8

Solutions

85

1.165. The given description is true on the average, but the curves (and a few calculations) give a more complete picture. For example, a score of about 675 is about the 97.5th percentile for both genders, so the top boys and girls have very similar scores. 1.166. (a) & (b) Answers will vary. Definitions might be as simple as “free time,” or “time spent doing something other than studying.” For part (b), it might be good to encourage students to discuss practical difficulties; for example, if we ask Sally to keep a log of her activities, the time she spends filling it out presumably reduces her available “leisure time.” 1.167. Shown is a stemplot; a histogram should look similar to this. This distribution is relatively symmetric apart from one high outlier. Because of the outlier, the five-number summary (in hours) is preferred: 22 23.735 24.31 24.845 28.55 Alternatively, the mean and standard deviation are x = 24.339 and s = 0.9239 hours.

22 22 23 23 24 24 25 25 26 26 27 27 28 28

013 7899 000011222233344444 55566666667777778888888999 00000011111112222222223333333333444444 555555666666666777777888888999999 00001111233344 56666889 2 56 2 5

1.169. Either a bar graph or a pie chart could be used. The given numbers sum to 66.7, so the “Other” category presumably includes the remaining 29.3 million subscribers.

Subscribers (millions)

1.168. Gender and automobile preference are categorical; age and household income are quantitative.

25 20 15 10 5

mc

Co

AT

&T Ro a adR st unn er Am Veriz eric o aO n nlin e Ear thL ink Ch arte r Q Ca west ble vis Un ited ion On line Oth er

0

1.170. Women’s weights are skewed to the right: This makes the mean higher than the median, and it is also revealed in the differences M − Q 1 = 14.9 lb and Q 3 − M = 24.1 lb. 1.171. (a) For car makes (a categorical variable), use either a bar graph or pie chart. For car age (a quantitative variable), use a histogram, stemplot, or boxplot. (b) Study time is quantitative, so use a histogram, stemplot, or boxplot. To show change over time, use a time plot (average hours studied against time). (c) Use a bar graph or pie chart to show radio station preferences. (d) Use a Normal quantile plot to see whether the measurements follow a Normal distribution.

Chapter 1

1.172. The counts given add to 6067, so the others received 626 spam messages. Either a bar graph or a pie chart would be appropriate. What students learn from this graph will vary; one observation might be that AA and BB (and perhaps some others) might need some advice on how to reduce the amount of spam they receive.

Spam count

86

Looking at Data—Distributions

1800 1600 1400 1200 1000 800 600 400 200 0 AA BB CC DD EE FF GG HH II JJ KK LL other Account ID

1.173. No, and no: It is easy to imagine examples of many different data sets with mean 0 and standard deviation 1—for example, {−1,0,1} and {−2,0,0,0,0,0,0,0,2}. Likewise, for any given five numbers a ≤ b ≤ c ≤ d ≤ e (not all the same), we can create many data sets with that five-number summary, simply by taking those five numbers and adding some additional numbers in between them, for example (in increasing order): 10, , 20, , , 30, , , 40, , 50. As long as the number in the first blank is between 10 and 20, and so on, the five-number summary will be 10, 20, 30, 40, 50. 1.174. The time plot is shown below; because of the great detail in this plot, it is larger than other plots. Ruth’s and McGwire’s league-leading years are marked with different symbols. (a) During World War II (when many baseball players joined the military), the best home run numbers decline sharply and steadily. (b) Ruth seemed to set a new standard for other players; after his first league-leading year, he had 10 seasons much higher than anything that had come before, and home run production has remained near that same level ever since (even the worst post-Ruth year—1945—had more home runs than the best pre-Ruth season). While some might argue that McGwire’s numbers also raised the standard, the change is not nearly as striking, nor did McGwire maintain it for as long as Ruth did. (This is not necessarily a criticism of McGwire; it instead reflects that in baseball, as in many other endeavors, rates of improvement tend to decrease over time as we reach the limits of human ability.)

League-leading HRs in season

70 60 50 40 30 20 10 0 1880

1900

1920

1940 Year

1960

1980

2000

Solutions 1.175. Bonds’s mean changes from 36.56 to 34.41 home runs (a drop of 2.15), while his median changes from 35.5 to 34 home runs (a drop of 1.5). This illustrates that outliers affect the mean more than the median.

87 1 2 2 3 3 4 4 5 5 6 6 7

69 4 55 3344 77 02 5669

3

1.176. Recall the text’s description of the effects of a linear transformation xnew = a + bx: The mean and standard deviation are each multiplied by b (technically, the standard deviation is multiplied by |b|, but this problem specifies that b > 0). Additionally, we add a to the (new) mean, but a does not affect the standard deviation. (a) The desired transformation is xnew = −40 + 2x; that is, a = −40 and b = 2. (We need b = 2 to double the standard deviation; as this also doubles the mean, we then subtract 40 to make the new mean 100.) . 1 . (b) xnew = −45.4545 + 1.8182x; that is, a = −49 11 = −49.0909 and b = 20 11 = 1.8182. 5 (This choice of b makes the new standard deviation 20 and the new mean 145 11 ; we then subtract 45.4545 to make the new mean 100.) (c) David’s score—2 · 72 − 40 = 104—is . higher within his class than Nancy’s score—1.8182 · 78 − 45.4545 = 96.4—is within her class. (d) A third-grade score of 75 corresponds to a score of 110 from the − 100 N (100, 20) distribution, which has a standard score of z = 110 20 = 0.5. (Alternatively, − 70 = 0.5.) A sixth-grade score of 75 corresponds to about 90.9 on the transformed z = 75 10 − 100 − 80 . scale, which has standard score z = 90.920 = −0.45. Therefore, about 69% of = 75 11 third graders and 32% of sixth graders score below 75.

1.177. Results will vary. One set of 20 samples gave Means Standard deviations the results at the right (Normal quantile plots are not 22 568 5 6 23 6 shown). 23 89 6 66899 Theoretically, x will have a Normal distribution √ 24 02 7 3 . with mean 25 and standard deviation 8/ 30 = 1.46, 24 89 7 25 3 8 113 so that about 99.7% of the time, one should find x 25 6799 8 789 between 20.6 and 29.4. Meanwhile, the theoretical dis26 124 9 000 tribution of s is nearly Normal (slightly skewed) with 26 9 556 59 . . mean = 7.9313 and standard deviation = 1.0458; about 27 4 10 2 99.7% of the time, s will be between 4.8 and 11.1. Note: If we take a sample of size√n from a Normal distribution and compute the sample standard deviation S, then (S/σ ) n − 1 has a “chi” distribution with n − 1 degrees of freedom (which looks like a Normal distribution when n is reasonably large). You can learn all you would want to know—and more—about this distribution on the Web (for example, at Wikipedia). One implication of this is √   that “on the average,” s underestimates σ ; specifically, 2 (n/2) the mean of S is σ √n − 1 (n/2 − 1/2) . The factor in parentheses is always less than 1, but approaches 1 as n approaches infinity. The proof of this fact is left as an exercise—for the instructor, not for the average student!

Chapter 2 Solutions 2.1. The cases are students. 2.2. When students are classified like this, PSQI is being used as a categorical variable, because each student is categorized by the group he/she falls in. One advantage is that it might simplify the analysis, or at least it might simplify the process of describing the results. (Saying that someone fell into the “poor” category is easier to interpret than saying that person had a PSQI score of 12.) A more subtle issue is that it is not clear whether finding an average is appropriate for these numbers; technically, averages are not appropriate for a quantitative measurement unless the variable is measured on an “interval” scale, meaning (for example) that the difference between PSQI scores of 1 and 2 is the same as the difference between PSQI scores of 10 and 11. 2.3. With this change, the cases are cups of Mocha Frappuccino (as before). The variables (both quantitative) are size and price. 2.4. One could make the argument that being subjected to stress makes it more difficult to sleep, so that SUDS (stress level) is explanatory and PSQI (sleep quality) is the response. 2.5. (a) The spreadsheet should look like the image on the right (especially if students use the data file from the companion CD). (b) There are 10 cases. (c) The image on the right shows the column headings used on the companion CD; some students may create their own spreadsheets and use slightly different headings. (The values of the variables should be the same.) (d) The variables in the second and third columns (“Bots” and “SpamsPerDay”) are quantitative.

2.6. Stemplots are shown; histograms would be equivalent. Students may choose different ways to summarize the data, such as bar graphs (one bar for each botnet). Note that summarizing each variable separately does not reveal the relationship between the two variables; that is done using a scatterplot in the next exercise. Because both distributions are skewed, we prefer five-number summaries to the mean and standard deviation. x s Min Q 1 M Bots (thousands) 99.7 96.6 12 20 67.5 Spams/day (billions) 13.6 18.6 0.35 2 7.0

88

Bots 0 1223 0 58 1 2 1 58 2 2 3 1

Q3 150 16

Max 315 60

Spams/day 0 002359 1 06 2 3 0 4 5 6 0

89

2.7. (a) The scatterplot is on the right. (b) Bobax is the second point from the right. (Bobax has the second-highest bot count with 185 thousand, but is relatively low in spam messages at 9 billion per day.)

Spams per day (billions)

Solutions

60 50 40 30 20 10

Bobax

0 0

100 150 200 250 Bots (thousands)

300

350

60,000,000,000 Spams per day

2.8. (a) The resulting spreadsheet is not shown. (b) Scatterplot on the right. (c) The points are arranged exactly as before, but the large numbers on the axes are distracting.

50

50,000,000,000 40,000,000,000 30,000,000,000 20,000,000,000 10,000,000,000 0 0

200,000 Bots

300,000

4.50 4.25 Cost ($)

2.9. Size seems to be the most reasonable choice for explanatory variable because it seems nearly certain that Starbucks first decided which sizes to offer, then determined the appropriate price for each size (rather than vice versa). The scatterplot shows a positive association between size and price.

100,000

4.00 3.75 3.50 3.25 10

12

14

16 18 20 Size (ounces)

22

24

90

Chapter 2

Looking at Data—Relationships

Increase 2.10. Two good choices are the change in debt from 2006 to 2007 (subtract the two numbers for each country) or the ratio −0 21 0 0044 of the two debts (divide one number by the other). Students 0 56667 may think of other new variables, but these have the most 1 034 direct bearing on the question. 1 2 13 Shown are stemplots of the increase (2007 debt minus 2 7 2006 debt, measured in US$ billions), and the debt ratio 3 (2007 debt divided by 2006 debt; these numbers have no 3 77 units). From either variable, we can see that debt increased 4 122 4 for all but two of the 24 countries. This can be summarized 5 34 using either the mean and standard deviation or the fivenumber summary (the latter is preferred for increase, because of the skew).

Increase Ratio

x 19.07 1.145

s 18.38 0.082

Min −2.89 0.984

Q1 5.25 1.098

M 12.205 1.143

Q3 37.66 1.193

Ratio 9 88 10 4 10 668 11 0223334 11 66778 12 013 12 589

Max 54.87 1.298

Note: In looking at increases, one notes that the size of the debt and the size of the change are related (countries with smaller debts typically changed less than countries with large debts). Debt ratio does not have this relationship with debt size (or at least it is less apparent); for this reason, it might be considered a better choice for answering this question. =

5000 4000 Debt 2007

2.11. The new points (marked with a different symbol) are far away from the others, but fall roughly in the same line, so the relationship is essentially unchanged: It is still strong, linear, and positive.

3000 =

2000 = =

1000 0 0

1000

2000 3000 Debt 2006

4000

LGDP per cap

2.12. Student choices of symbols + 11 o + o+ will vary; the plot on the right Forbes rank ++ ++++++ + +++ + + o + ++ ++ o+ ++ uses +, o, −, rather than the 10 o + Top third o +++ o ++ o o– + o–– o+ +o more obvious H, M, L. (The + o + oo o –o o o Middle third o o 9 –o o – latter symbols are harder to dis– o oo – –oo oo – o oo o – – Bottom third – – – 8 tinguish when overlapping.) –– – – – – – – o –– – –– This graph reinforces the ob– 7 – servation in Example 2.14 that 6 GDP ties in closely with rank0 1 2 3 4 ings; generally, high GDP goes L Unemployment with high rank, middle GDP with middle rank, and low GDP with low rank. As before, the relationship (if any) between unemployment and rankings is not so clear.

Solutions

91

2.13. (a) A boxplot summarizes the distribution of one variable. (Two [or more] boxplots can be used to compare two [or more] distributions, but that does not allow us to examine the relationship between those variables.) (b) This is only correct if there is an explanatory/response relationship. Otherwise, the choice of which variable goes on which axis might be somewhat arbitrary. (c) High values go with high values, and low values go with low values. (Of course, those statements are generalizations; there can be exceptions.) 2.14. (a) The points should all fall close to a negatively sloped line. (b) Look for a “cloud” of points with no discernible pattern. Watch for students who mistakenly consider “no relationship” as meaning “no linear relationship.” For example, points that suggest a circle, triangle, or curve may indicate a non-linear relationship. (c) The points should be widely scattered around a positively sloped line. (d) Sketches might be curved, angular, or something more imaginative.

X

X

X

6000 4000 2000 X

0