Examiners often are interested in knowing if the difference
between two test scores is significant. Applying one of several discrepancy
analysis procedures usually does this. Examiners can analyze score or cluster
differences obtained on particular tests (e.g., they can compare the WJ-R COG
Memory for Words subtest score to the Memory for Sentences subtest score or
compare the Fluid Ability Cluster score to the Crystallized Ability Cluster
score). Some procedures for comparing scores within tests are described here.
Confidence Bands
Test scores are never perfectly accurate. Lucky or unlucky guesses,
lapses of attention, and other factors mean that the same person would almost
never get exactly the same score on a test twice in a row. A confidence band
around a score tells how scores on that test are likely to vary by pure chance.
If the confidence bands on two scores overlap, there probably is not a
significant difference between the two scores. On another day the higher and
lower scores might have been reversed.
If the confidence bands on two scores do not overlap, and if both scores are
probably valid, there probably is a significant difference between the two
scores. On another day, the higher and lower scores would probably have still
been the higher and lower scores, respectively.

In the example above, there is a triumph of Hope over Experience, but neither
is significantly different from Dumb Luck.
Base-rate
Base-rate refers to the prevalence or frequency of a
particular occurrence or event within a population. Awareness of relevant
base-rate data allows an evaluator to determine the diagnostic utility of a
particular sign. Although a particular relevant comparison may reach some level
of statistical significance, it is always necessary to determine if the
statistical difference is a usual or an unusual one. Base-rate information
provides just such data.
Testing the Difference of Scores within the Same Test
One can test the differences between any 2 of the 21 WJ-R COG
subtests and/or between any of the 7 WJ-R COG composites associated with the
McGrew, Flanagan, and Ortiz integrated Carrol/Cattell-Horn Gf-Gc
model. The first step in conducting a discrepancy analysis between two WJ-R COG
subtest scores is to calculate the actual difference between the scores in
question. This is computed by subtracting the lower test standard score from the
higher test standard score. The next step is to determine if the amount of point
difference is large enough to be of any interest. We describe two methods that
can be used to examine within-test difference scores for importance. The first
examines the statistically significant difference between two test scores; the
second examines whether or not the difference is large enough to be considered
clinically useful.
Statistical Significance
The first step in examining difference between scores is to
see if the difference is beyond that which would be expected by chance alone.
Anastasi and Urbina (1997) provide a formula to help determine how large a
Difference Score must be in order to be statistically significant.
This formula has been adapted to read:
Significant Difference Score = SD*Z*Sqrt[2-(r1+r2)]
where, SD = standard deviation of the two scores, Z =
statistical significance level, r1 = reliability of the first score, and r2 =
reliability of the second score.
All subtests and composites of the WJ-R have a standard
deviation of 15. For our purposes, the significance level at .05 was employed,
which is represented on the z-distribution table as 1.96. Table 7.1 of the
Woodcock manual ( Woodcock, R. W., & Mather, N. (1989). WJ-R Tests of Cognitive Ability -- Standard and Supplemental Batteries:
Examiner's Manual. In R. W. Woodcock & M. B. Johnson, Woodcock-Johnson Psycho-Educational Battery--Revised. Chicago: Riverside Publishing Co.)
(p 117) provides the median internal consistency reliability
coefficients for the WJ-R COG subtests and composites across the standardization
sample ages. Thus, we can use the formula to determine the minimal Difference
Score required for significance for all subtest and composite combinations.
When considering score differences, one should consider the
true meaning concerning differences between two test scores that are not
significant at a desired level. If the difference is due to chance, then for all
practical purposes, the difference should be thought of as being zero. There is
no real meaning to saying something is "almost significant."
Therefore, when making comparisons among WJ-R COG subtests and composites,
differences that are not significant at the .05 level should be interpreted to
mean that the examinee demonstrated equal abilities in the abilities measured by
the subtests or composites.
If two test scores are significantly different from one
another, one still cannot assume that the differences are unusual enough to be
clinically useful (i.e., that the differences are rare enough to be of value). To
help determine how severe the discrepancy must be to be considered clinically
useful, frequency tables were created from the standardization sampling.
Method
To create the tables for significant differences, the following formulas were
utilized:
Significance level for multiple comparisons: The Davis (1959) formula used to
compute the deviations from the average that are significant at the desired level
of significance. That formula is:
SQRT(SUM(∑SEmT2)/n2+((n-2/n)*SEmI2))*Bonferroni
Correction
- ∑SEmT2= Sum of the Standard Errors of Measurement Squared for all
subtests included in the comparison.
- n= the total number of subtests in the comparison.
- SEmI2= The Standard Error of Measurement Squared for the individual
subtest in question.
- Bonferroni Correction= The adjustment made for alpha slippage due to
multiple comparisons and set at 95% confidence.
Links to the tables and descriptions of how to use them are below:
Between tests: For individual test strength or weakness,
compared to all other tests: determine the mean of the 7, 14, or 21 tests
administered. Subtract the obtained score of the desired test from the total
mean. If the absolute value of the resulting number is greater than the
"Significance level" value
in appropriate column, the test may be considered a strength or a weakness.