| ||||||
|
|
A quick
Assessment Question from a colleague Hi John, I have a question for you. There are T-scores, standard
scores, scaled scores, stanines, etc. Why? Why are there different means of
measurements? Do they measure something different each time? John O. Willis wrote: To: xxxxxxxxxxxxxxxxxx Dear xxxxxxxxxx,
These are all standard scores.
They are defined by their mean and standard deviation.
The basic, lingua franca, Rosetta Stone one is the z score with a mean
set at 0.00 and a standard deviation of 1.00. Aside from the desire to immortalize themselves by creating
and popularizing a new statistic, test authors usually try to pick a score that
is appropriately fine-grained for the purpose.
Similarly, you might measure insects in millimeters, small
mammals in centimeters, larger mammals in inches, and wicked big mammals in feet
(if they'll stand still for it). Units that are too small and fine are cumbersome and
inaccurate and give a false impression of more precision than the test really
offers. Units that are too large don't discriminate finely enough
and waste data. Therefore, for example, Wechsler used the fine-grained
standard score (also called a composite or a quotient by other authors) with a
mean of 100 and standard deviation of 15 for his total IQ scores, but the
coarser scaled score (called a standard score by some authors) with a mean of 10
and standard deviation of 3 for subtests, which are much less reliable than IQ
and factor scores. [Actually Wechsler picked a "probable error" of
10" to get nice, round numbers. A
probable error of 10 is the same as a standard deviation of 15.
See above.] Colin Elliott uses the Wechsler-style, fine-grained
standard score with a mean of 100 and standard deviation of 15 for his total
cluster scores, but the coarser T score with a mean of 50 and standard deviation
of 10 for subtests. Elliott thought
that his subtests were more statistically reliable and had more specificity than
subtests on other, similar instruments, so he wanted a scale that was coarser
than a standard score (m = 100, sd = 15), but more fine grained than a scaled
score (m = 10 sd = 3). The T score
(m = 50, sd = 10) was a good compromise. Fads, attitudes, and traditions are also a part of the
choice. Lewis Terman, Maud Merrill,
and Quinn McNemar picked a mean of 100 and standard deviation of 16, not 15 for
the Stanford-Binet Intelligence Scale. They
did this partly because those statistics were pretty close to the average actual
empirical finding using the old-fashioned quotient-type IQs [(mental
age/chronological age) x 100] and partly, I suspect, just to be different from
Wechsler. Later Binet editions
maintained this annoying difference, and the fourth edition even introduced a
subtest score with a mean of 50 and sd of 8.
[It was confusingly similr to a T score, but moderately handy because you
could simply double it to get the standard-score equivalent.]
The new, 5th ed. of the Stanford-Binet finally threw in the towel and
went with m = 100 sd = 15. Tests and questionnaires assessing attitudes, traits, and
behaviors have traditionally used T scores, so new ones often continue to. Kirk, McCarthy, Kirk, and Paraskevopoulos chose a bizarre
statistic with a mean of 36 and standard deviation of 6 for their 1968 Illinois
Test of Psycholinguistic Abilities (ITPA, very different from the new 3rd ed of
the ITPA). They did not want ITPA
scores confused with IQs. The new edition of the Vineland Adaptive Behavior Scale (VABS)
has a v-scale score, which is just like a scaled scored pumped up 5 points.
The mean is 15, not 10, but the standard deviation is still 3, so a
scaled score of 1 is a v-scale score of 6, a scaled score of 10 is a v-scale
score of 15, and so on. They did
this because the VABS is often used with persons who score extremely low on part
or all of the scale, and they wanted to be able to distinguish between various
degrees of low scores. Annoying for
the practitioner, but actually a good idea. See? It all
makes sense. Percentile ranks are different.
They are not based on the mean and standard deviation, but on a nose
count from the bottom up. The lowest
1% are in the first percentile, the next lowest 1% are in the second percentile,
and the highest 1% are in the 99th percentile.
In a perfectly normal distribution (or one that has been
jiggered to make it normal) there is a predictable relationship between the
various kinds of standard scores and percentile ranks
(see attached files). If the
distribution is not perfectly normal and has been left alone, the relationships
are slightly different and may vary from subtest to subtest and from age to age.
This is true of the Woodcock-Johnson and Woodcock Reading Mastery Tests,
on which a standard score of, say, 90 might have a percentile rank of 22, 23,
24, 25, 26, or 27. A few tests,
including one popular phonology test and some tests used by some occupational
therapists have notably skewed distributions resulting in very inconsistent
relationships between standard scores and percentile ranks. Percentile ranks are not equal units.
They are all bunched up in the middle and spread way out at the extremes.
You may find a median if you wish, but you must not add, subtract,
multiply, nor divide them. It is
meaningless to write of a "gain of x percentiles" or "x
percentile points." Well, not
quite meaningless. It was a gain.
You just have no idea how much of a gain. Normal Curve Equivalents (NCE) were invented by our very
own federal government to create a statistic that looked like a percentile rank,
but was equal units so they could be manipulated statistically to measure
progress in federally funded programs. This
must be the only time our government ever tried to create something that was not
really what it appeared to be. I
cannot imagine their ever doing that again.
NCEs match up precisely with percentile ranks at 1, 50, and 99, but
nowhere else. Kinda like taking a
flexible meter stick and nailing it to a rigid yardstick at each end and in the
middle. Stanines can be defined by either a mean and standard
deviation (m = 5, s.d., = 1.96), which makes them a standard score, or by
percentages (the middle 20% is stanine 5; the next 17% are stanines 4 and 6, the
next 12% are stanines 3 and 7, the next 7% are stanines 2 and 8, and the most
extreme 4% at each end are stanines 1 and 9), which makes them more like
percentile ranks. If the
distribution is normal, it doesn't matter practically. John |
|
Content on these pages is copyrighted by Dumont/Willis © (2001) unless otherwise noted. |