3 million visitors plus:

  Appletcollection Vertical Menu java applet, Copyright 2003 GD

A quick Assessment Question from a colleague

Hi John,

I have a question for you. There are T-scores, standard scores, scaled scores, stanines, etc. Why? Why are there different means of measurements? Do they measure something different each time?

John O. Willis wrote:

To: xxxxxxxxxxxxxxxxxx

Dear xxxxxxxxxx,

 

mean

standard deviation

middle 1/2 or probable error

middle 2/3 (68%)or +/- 1 s.d.

 

 

 

 

 

z scores

0

1

-0.67 -- +0.67

1.00 -- +1.00

 

 

 

 

 

NCE

50

21.06

36 --  64

29 -- 71

 

 

 

 

 

T scores

50

10

43  --  57

40 -- 60

 

 

 

 

 

standard  scores

100

15

90 -- 110

85 -- 115

 

 

 

 

 

scaled scores

10

3

8 -- 12

7 -- 13

 

 

 

 

 

v-scale scores

15

3

13 -- 17

12 -- 18

 

 

 

 

 

These are all standard scores.  They are defined by their mean and standard deviation.  The basic, lingua franca, Rosetta Stone one is the z score with a mean set at 0.00 and a standard deviation of 1.00.

Aside from the desire to immortalize themselves by creating and popularizing a new statistic, test authors usually try to pick a score that is appropriately fine-grained for the purpose. 

Similarly, you might measure insects in millimeters, small mammals in centimeters, larger mammals in inches, and wicked big mammals in feet (if they'll stand still for it). 

Units that are too small and fine are cumbersome and inaccurate and give a false impression of more precision than the test really offers.

Units that are too large don't discriminate finely enough and waste data.

Therefore, for example, Wechsler used the fine-grained standard score (also called a composite or a quotient by other authors) with a mean of 100 and standard deviation of 15 for his total IQ scores, but the coarser scaled score (called a standard score by some authors) with a mean of 10 and standard deviation of 3 for subtests, which are much less reliable than IQ and factor scores.

[Actually Wechsler picked a "probable error" of 10" to get nice, round numbers.  A probable error of 10 is the same as a standard deviation of 15.  See above.]

Colin Elliott uses the Wechsler-style, fine-grained standard score with a mean of 100 and standard deviation of 15 for his total cluster scores, but the coarser T score with a mean of 50 and standard deviation of 10 for subtests.  Elliott thought that his subtests were more statistically reliable and had more specificity than subtests on other, similar instruments, so he wanted a scale that was coarser than a standard score (m = 100, sd = 15), but more fine grained than a scaled score (m = 10 sd = 3).  The T score (m = 50, sd = 10) was a good compromise.

Fads, attitudes, and traditions are also a part of the choice.  Lewis Terman, Maud Merrill, and Quinn McNemar picked a mean of 100 and standard deviation of 16, not 15 for the Stanford-Binet Intelligence Scale.  They did this partly because those statistics were pretty close to the average actual empirical finding using the old-fashioned quotient-type IQs [(mental age/chronological age) x 100] and partly, I suspect, just to be different from Wechsler.  Later Binet editions maintained this annoying difference, and the fourth edition even introduced a subtest score with a mean of 50 and sd of 8.  [It was confusingly similr to a T score, but moderately handy because you could simply double it to get the standard-score equivalent.]  The new, 5th ed. of the Stanford-Binet finally threw in the towel and went with m = 100 sd = 15.

Tests and questionnaires assessing attitudes, traits, and behaviors have traditionally used T scores, so new ones often continue to.

Kirk, McCarthy, Kirk, and Paraskevopoulos chose a bizarre statistic with a mean of 36 and standard deviation of 6 for their 1968 Illinois Test of Psycholinguistic Abilities (ITPA, very different from the new 3rd ed of the ITPA).  They did not want ITPA scores confused with IQs.

The new edition of the Vineland Adaptive Behavior Scale (VABS) has a v-scale score, which is just like a scaled scored pumped up 5 points.  The mean is 15, not 10, but the standard deviation is still 3, so a scaled score of 1 is a v-scale score of 6, a scaled score of 10 is a v-scale score of 15, and so on.  They did this because the VABS is often used with persons who score extremely low on part or all of the scale, and they wanted to be able to distinguish between various degrees of low scores.  Annoying for the practitioner, but actually a good idea.

See?  It all makes sense.

Percentile ranks are different.  They are not based on the mean and standard deviation, but on a nose count from the bottom up.  The lowest 1% are in the first percentile, the next lowest 1% are in the second percentile, and the highest 1% are in the 99th percentile. 

In a perfectly normal distribution (or one that has been jiggered to make it normal) there is a predictable relationship between the various kinds of standard scores and percentile ranks  (see attached files).  If the distribution is not perfectly normal and has been left alone, the relationships are slightly different and may vary from subtest to subtest and from age to age.  This is true of the Woodcock-Johnson and Woodcock Reading Mastery Tests, on which a standard score of, say, 90 might have a percentile rank of 22, 23, 24, 25, 26, or 27.  A few tests, including one popular phonology test and some tests used by some occupational therapists have notably skewed distributions resulting in very inconsistent relationships between standard scores and percentile ranks.

Percentile ranks are not equal units.  They are all bunched up in the middle and spread way out at the extremes.  You may find a median if you wish, but you must not add, subtract, multiply, nor divide them.  It is meaningless to write of a "gain of x percentiles" or "x percentile points."  Well, not quite meaningless.  It was a gain.  You just have no idea how much of a gain.

Normal Curve Equivalents (NCE) were invented by our very own federal government to create a statistic that looked like a percentile rank, but was equal units so they could be manipulated statistically to measure progress in federally funded programs.  This must be the only time our government ever tried to create something that was not really what it appeared to be.  I cannot imagine their ever doing that again.  NCEs match up precisely with percentile ranks at 1, 50, and 99, but nowhere else.  Kinda like taking a flexible meter stick and nailing it to a rigid yardstick at each end and in the middle. 

Stanines can be defined by either a mean and standard deviation (m = 5, s.d., = 1.96), which makes them a standard score, or by percentages (the middle 20% is stanine 5; the next 17% are stanines 4 and 6, the next 12% are stanines 3 and 7, the next 7% are stanines 2 and 8, and the most extreme 4% at each end are stanines 1 and 9), which makes them more like percentile ranks.  If the distribution is normal, it doesn't matter practically.

John

 

Content on these pages is copyrighted  by Dumont/Willis © (2001) unless otherwise noted.