Since 2002, 4 million visitors plus:
hit counters
search engine optimization service

  Appletcollection Vertical Menu java applet, Copyright 2003 GD

Double Standards in Testing or Do We Need a Gun Control Law for Testers

Cisco (Ron Dumont Ed.D., NCSP) and Eggbert (John Willis, Ed.D.)

Eggbert and I want a Brady Bill for testers. In the real world, we require people who want to use a gun to get a permit. We do it because we recognize that guns are considered dangerous weapons in the hands of those ill prepared to use them. We suggest that tests can be the same: they are dangerous weapons in the hands of the unqualified and we need stricter laws to protect us from unfortunate hunting accidents. Some will argue that there are laws in place to protect us and others: but are they doing the job they should? Minimum protection is what the law guarantees. If making sure that tests meet standards for reliability and validity is a minimum standard, we must follow it. We believe that most school psychologists do follow this minimum standard practice, but we believe there exists a double standard. Evaluators outside the school setting are often not held accountable for the type of evaluation they do. Batteries of tests are given in non standardized fashion with little or no mention of it made in the report. We have watched the faces of evaluators turn red with anger when we have dared to question them about the tests, test procedure, and the results. God forbid a question about reliability and validity should ever come up. [If a test can't predict itself (reliability), can we expect it to predict anything else (learning disabilities)]?

Why is it that school districts are often held to a standard of testing that others are not expected to maintain. The New York state regulations (not to mention the Federal guidelines) clearly state the requirements for the testing we do. (Part 200 regulations: "School districts shall ensure that:") The law does not quibble. It states that we must ensure that minimum protections are afforded all students. Would parents, parent advocates, and attorneys, let us evaluate a child with nonstandardized tests, in a nonstandardized way, with extremely small norming samples, and decide based upon them that the child was not handicapped. Is it only when these test procedures confirm what someone else wants that "we can get away with it"? (Lest anyone get the wrong idea, we don't advocate that we try to "get away with it." Eggbert and I want us never to do it in the first place, and to hold those that do accountable for not following the law.)

As part of our position as an educational consultant to school districts and to lawyers involved in educational due process cases, we have reviewed quite a few reports and evaluations done by others. As we go through these reports they range from excellent to highly questionable. For example, we reviewed a test report so large it started with a table of contents. It contained over 45 pages of test explanations. (The child had been administered 64 separate tests). At the end of the report came the four recommendations, only one of which concerned the child: "Further testing to clarify the issues raised by this evaluation is needed" !!! To steal a phrase directly from Eggbert "the integration of this 45 page report was done with the staple." Here the issue may have been too much testing with too little integration.

In another case, we had not been allowed to evaluate the extensive report before the team meeting. When the report was finally handed out to the team, we simply held it to our forehead and did a bad "Karnac the Magician" imitation. "And the answer is..." Without ever reading it, we were both able to repeated verbatim the recommendations made (something akin to the performance of Tweedledee and Tweedledum). The parents were amazed, but still believed everything presented in the report. Maybe here the case was too little interpretation.

Finally, a child, tested by her school district, was found not to have any serious learning disabilities and therefore was not identified as handicapped. (She was though, offered non-special education services to address some of the issues highlighted by her evaluation.) Her parents were encouraged by a child advocate to take her for further testing "by the experts." They could surely find the problem that the school had overlooked. When she arrive at the evaluators office, the evaluator was ready. Having reviewed all the pertinent background data, the evaluator knew exactly what the referral problem was and what areas needed to be assessed. Parent interviews had gleaned a lot of information about the child, and the testing done by the school was also reviewed. Over the next 6 hours, the child was assessed in the areas of intellect, neurodevelopment, academic achievement, and social emotional stability. Tests include the WISC-III (her second in 3 weeks), parts of the Stanford Binet 4th Ed., subtests of the DTLA-III, the TOLD, Bender, Rey-Osterreith, parts of the WRAML, the CAVLT, the WMS-R, Menyuck Sentences, Trails A and B, the Stroop, Cancellation tests, the Hooper, the Money Road Map Test, Sentence Completions, TAT, the Rorschach, Woodcock-Johnson Psychoeducational Battery (Cognitive and Achievement), The PEERAMID, and the WRAT-R. After this evaluation, the report developed by the evaluator stated that there were "significant learning disabilities", and "that these undiagnosed disabilities created an 'assault' on the child's emotional well being."

How does one respond to such a thorough evaluation by a "competent expert". First, we must accept that the expert may in fact be right and that the evaluation done by the school had missed the boat completely. When there is such divergent opinion on both sides, one way to solve the dilemma is to evaluate the evaluation. This is necessary but unfortunately leads to "dueling psychologists", and seldom ends in agreement between the parties.

There are a number of ways to review a report but this essay focuses on minimum protection. Focusing on only the third example above, I believe that minimum protection was not offered to the child and therefore the results of the evaluation are questionable from both a legal as well as psychometric points of view.

Obvious issues to raise would include: the effects of test retest. Did the expert, who knew what tests had been given by the school, repeat them after a short period of time and not account for the typical change in scores? Besides the obvious fact that any child taking this many tests in such a short period of time will be fatigued and probably overwhelmed, what is the need for such an extended battery? What is the effect on subsequent test scores of adding more and more tests to an evaluation? Does the adding of test after test increase the reliability of discovering the problem? (Cisco once challenged an opposing lawyer to let him test him for just 10 minutes. Cisco would guarantee to "find" something wrong. It's really just a matter of choosing the right test. The lawyer declined.) Is there such a thing as incremental validity and is it something that we should strive for? Since an individual's ability to synthesize and understand all the small incremental pieces of knowledge is extremely limited, what is gained by giving such a 'complete' battery? It could be that what is gained is test/testee bias, confirmatory bias, illusory correlations, etc.. How often do we receive a request to evaluate and then find nothing wrong? We are the experts; we're supposed to be able to find and explain the difficulty. Do we find difficulty or simply confirm what others told us to find? Isn't testing and evaluations supposed to be objective? Shouldn't every student who comes to us be viewed as normal, and we are out to prove it?

"Tests and other assessment procedures: (b) have been validated for the specific purpose for which they are used; "

Review the list of tests given. We would characterize some of the tests as esoteric. (That's to choose a nice word for what we really think of some of them.) What are the psychometric properties of each? The Stroop for example: Created in the 30's and renormed in the 70's by Charles Golden. One evaluator gave this test to a 7 year old and diagnosed "diffuse bilateral brain damage." Serious stuff for a test with extremely questionable statistics. The data used to create the age based norms were lost when Dr. Golden moved from California (personal conversation 10/89). To his credit, the manual states that the norms are provisional and need to be updated. Unfortunately this hasn't happened. Does it matter? Cisco's study of the Stroop found the age corrected norms to be so far off that a normal person will receive scores placing them from .3 to .9  standard deviation units below the mean. How stable and reliable are the norms for some of the tests? Remember, the standards demand that the test be reliable and valid for the purpose for which they are given. Although the standards don't say that the responsibility for choosing tests that are reliable and valid ultimately lies with the evaluator, who else's responsibility should it be? It must also be noted that this does not mean, as some have suggested, that the evaluator creates these properties. We once sat open mouthed at a special education hearing as we listened to an evaluator explain that because s/he was an expert tester, the tests and results were therefore valid and reliable. This person seemed to think that the tests become reliable just by giving them properly. Tests don't become reliable, they either are or aren't based upon the statistics. How many people are included in the norming samples of tests like the Cancellation test, the Trail Making Test, or the Rey-Osterreith Complex Figure task? Often it will depend upon whose norms you wish to use and choosing norms can make a big difference in interpretation. A score on the Rey using the original norms will place an 8 year old well below average, yet the same score using the Whishaw-Kolb norms places the same child at dead average.

"and (c) are administered by trained personnel in accordance with the instructions provided by those who developed such tests or procedures."

Looking at the list of tests given one notes that in a number of cases only selected subtests of larger tests were given. Is this minimum protection? What happens when tests are mutilated? How can an evaluator claim reliable results when the test is given in an unstandardized way? What other nonstandardized practices can be allowed? Can we disregard starting and stopping rules and still compute standard scores? Can we raise or lower the scores of children simply by asking probing questions? As soon as we change from standardized administration to nonstandardized administration, we corrupt the resulting score. If we don't account for this change in the report, we have not met legal and ethical standards. Even if we do report the changes, can we now report anything about the reliability of the subtest?

Use of certain tests and their results with a suspected learning disabled child can raise another whole set of problems. Does a language disability or perceptual difficulty effect the way a child might respond to the projective stimulus of the Rorschach or other projective tests? For the learning disabled child will perception alter the output and thus effect the scoring? A review of the PSYCHLIT data base, running a search for Rorschach and Learning Disabilities, found a total of only 2 articles written addressing the problem. Some would suggest that we just not score the test protocols, and that we focus on the 'process.' That's fine to a degree. The process is certainly important when we evaluate children, but what if the process, no matter how difficult or revealing, results in "average" scores? Is a child disabled because the process is difficult but the results are normal?

Are evaluations done with tests that don't meet minimum protection standards harmful to children? Why should we (and everyone else) be held accountable for the testing we do? What's wrong with giving esoteric tests, or tests with no norms, or norms so abysmal that a child will be compared to 5 others to determine abnormality? Why not use a test that has no reliability or validity studies? What harm is done? Why not pick and choose subtests from larger tests, give them, and consider them valid measures of what they say they measure? Why not administer tests in a nonstandardized manner? Is the only reason to do it because the law says to?

The answer to these questions is fairly simple. Yes, it is the law and that's the reasons to be held accountable, but beyond that, it's the child. Is it fair to evaluate a child using poor assessment practices or poor assessment tools and identify strengths and/or weaknesses when they don't exist? We say no. It is fair and ethical to identify strengths and weaknesses using highly valid and reliable tools. From these and only these can real educational remediations be made. Don't we harm a child by trying to remediate a weakness when it doesn't exist? Or try to use a strength as a compensation when the strength isn't there in the first place.

Eggbert and I have developed a list of questions that we feel any examiner who evaluate children (and those who might need to protect the children) should be able to ask and answer. Here are just a few:

• What is reliability? How do you know if it's high enough to be useful? Is it created by the test or the tester? Why is it important?

• What is validity? How do you determine it?

• Can a test be reliable and not valid, or valid but not reliable?

• What was the standardization group? How many and from where? Was it a random stratified group or something else? Was enough data compiled to make reasonable assumptions based upon SES and geographic locations?

• If a cognitive battery has been given, can the person who gave it define the construct it measures. (What is intelligence?)

• What is subtest specificity? Does the subtest you are interpreting have adequate specificity?

• What is factor analysis? Is the test being used factorially valid?

• Whose norms were used? How old are the norms? Into what discrete reference groups are the norms broken down?

We present two opposing proposals about the assessment of children. Here they are:

Proposal 1: Accountability. We would propose that we all be held accountable for the testing we do. If we evaluate with tools that don't meet minimum requirements, let's say so, or do something about it! Let's be sure the parents understand how we arrived at our decisions. Let's require the presentation of norms if asked, and even if not asked. If we can't produce reliability and validity statements about the tests we use, let's be forthcoming with that knowledge. Remember the quote "In God we trust, all others bring the data." Let's not be shy about sharing protocols, or discussing second opinions. If we test properly, why not share the protocol. If someone else interprets it differently, so be it, but the protocol itself ought to be fair game. If we make a statement about a child based upon some response or group of scores, we should be willing to share the exact response and/or scores with anyone qualified or willing to interpret them. How often do we see Rorschach interpretations with no scoring sheet attached? We are willing to give WISC-III subtest scores, why not Rorschach responses and scores? We must be held accountable for our evaluations, and they should and must meet state standards. These are minimal standards. Why can't the minimum be met? If we meet them, so too must all evaluators.

Proposal 2: NONSENSE. A way of testing that uses a simple approach. This approach does not meet federal/state standards but others are using it so why can't we! Following are the eight rules of NONSENSE: