Q.2 Explain different types of scores used to interpret test results.
ANS:
Three of the fundamental purposes for testing are (1) to describe each student's developmental level within a test area, (2) to identify a student's areas of relative strength and weakness in subject areas, and (3) to monitor year-to-year growth in the basic skills. To accomplish any one of these purposes, it is important to select the type of score from among those reported that will permit the proper interpretation. Scores such as percentile ranks, grade equivalents, and standard scores differ from one another in the purposes they can serve, the precision with which they describe achievement, and the kind of information they provide. A closer look at these types of scores will help differentiate the functions they can serve and the meanings they can convey. Additional detail can be found in the Interpretive Guide for Teachers and Counselors.
In Iowa, school districts can obtain scores that are reported using national norms or Iowa norms. On some reports, both kinds of scores are reported. The difference is simply in the group with which comparisons are made to obtain score meaning. A student's Iowa percentile rank (IPR) compares the student's score with those of others in his/her grade in Iowa. The student's national percentile rank (NPR) compares that same score with those of others in his/her grade in the nation. For other types of scores described below, there are both Iowa and national scores available to Iowa schools.
Interpreting Test Results: Test results are generally presented in terms of numerical scores, such as raw scores, standard scores, and percentile scores. In order to interpret test scores effectively, the scoring system used needs to understood. The type of scores used is as follows:
I) Raw Scores: These refer to the unadjusted scores on the test. Usually the raw score is the number of questions answered correctly, as in mental ability or achievement tests. Some types of assessment tools and personality tests have no "right" or "wrong" answers. In such cases, the raw score may represent the number of positive responses for a particular trait. Raw scores do not provide any useful information. E.g. consider a candidate who gets 25 out of 50 questions correct on a test. It’s hard to know whether "25" is a good score or a poor score. When the results are compared with the other individuals who took the same test, you may discover that this was the highest score on the test.
The number of questions a student gets right on a test is the student's raw score (assuming each question is worth one point). By itself, a raw score has little or no meaning. The meaning depends on how many questions are on the test and how hard or easy the questions are. For example, if Kati got 10 right on both a math test and a science test, it would not be reasonable to conclude that her level of achievement in the two areas is the same. This illustrates why raw scores are usually converted to other types of scores for interpretation purposes.
II) Standard Scores: Standard scores are converted from raw scores. This indicates where a candidate’s score lies in comparison to a group. e.g. if the test indicates that the average or mean score for the group on a test is 50, then a candidate who gets a high score is above average, and a candidate who gets a low score is below average.
III) Percentile Score: A percentile score is another kind of converted score. A candidate’s raw score is converted into a number indicating the percent of employees in the group who scored below the test taker. E.g. a score at the 70th percentile means that the candidate’s score is the same as or higher than the scores of 70% of those who took the test. The 50th percentile is known as the median and represents the middle score of the distribution.
When the raw score is divided by the total number of questions and the result is multiplied by 100, the percent-correct score is obtained. Like raw scores, percent-correct scores have little meaning by themselves. They tell what percent of the questions a student got right on a test, but unless we know something about the overall difficulty of the test, this information is not very helpful. Percent-correct scores are sometimes incorrectly interpreted as percentile ranks, which are described below. The two are quite different.
Grade Equivalent (GE)
The grade equivalent is a number that describes a student's location on an achievement continuum. The continuum is a number line that describes the lowest level of knowledge or skill on one end (lowest numbers) and the highest level of development on the other end (highest numbers). The GE is a decimal number that describes performance in terms of grade level and months. For example, if a sixth-grade student obtains a GE of 8.4 on the Vocabulary test, his score is like the one a typical student finishing the fourth month of eighth grade would likely get on the Vocabulary test. The GE of a given raw score on any test indicates the grade level at which the typical student makes this raw score. The digits to the left of the decimal point represent the grade and those to the right represent the month within that grade.
Grade equivalents are particularly useful and convenient for measuring individual growth from one year to the next and for estimating a student's developmental status in terms of grade level. But GEs have been criticized because they are sometimes misused or are thought to be easily misinterpreted. One point of confusion involves the issue of whether the GE indicates the grade level in which a student should be placed. For example, if a fourth-grade student earns a GE of 6.2 on a fourth-grade reading test, should she be moved to the sixth grade? Obviously the student's developmental level in reading is high relative to her fourth-grade peers, but the test results supply no information about how she would handle the material normally read by students in the early months of sixth grade. Thus, the GE only estimates a student's developmental level; it does not provide a prescription for grade placement. A GE that is much higher or lower than the student's grade level is mainly a sign of exceptional performance.
In sum, all test scores, no matter which type they are or which test they are from, are subject to misinterpretation and misuse. All have limitations or weaknesses that are exaggerated through improper score use. The key is to choose the type of score that will most appropriately allow you to accomplish your purposes for testing. Grade equivalents are particularly suited to estimating a student's developmental status or year-to-year growth. They are particularly ill-suited to identifying a student's standing within a group or to diagnosing areas of relative strength and weakness.
Developmental Standard Score (SS)
Like the grade equivalent (GE), the developmental standard score is also a number that describes a student's location on an achievement continuum. The scale used with the ITBS and ITED was established by assigning a score of 200 to the median performance of students in the spring of grade 4 and 250 to the median performance of students in the spring of grade 8.
The main drawback to interpreting developmental standard scores is that they have no built-in meaning. Unlike grade equivalents, for example, which build grade level into the score, developmental standard scores are unfamiliar to most educators, parents, and students. To interpret the SS, the values associated with typical performance in each grade must be used as reference points.
The main advantage of the developmental standard score scale is that it mirrors reality better than the grade-equivalent scale. That is, it shows that year-to-year growth is usually not as great at the upper grades as it is at the lower grades. (Recall that the grade-equivalent scale shows equal average annual growth -- 10 months -- between any pair of grades.) Despite this advantage, the developmental standard scores are much more difficult to interpret than grade equivalents. Consequently, when teachers and counselors wish to estimate a student's annual growth or current developmental level, grade equivalents are the scores of choice.
The potentials for confusion and misinterpretation that were described in the previous subsection for the GE are applicable to the SS as well. Relative to the GE, the SS is not as easy to use in describing growth, but it is equally inappropriate for identifying relative strengths and weaknesses of students or for describing a student's standing in a group.
Percentile Rank (PR)
A student's percentile rank is a score that tells the percent of students in a particular group that got lower raw scores on a test than the student did. It shows the student's relative position or rank in a group of students who are in the same grade and who were tested at the same time of year (fall, midyear, or spring) as the student. Thus, for example, if Toni earned a percentile rank of 72 on the Language test, it means that she scored higher than 72 percent of the students in the group with which she is being compared. Of course, it also means that 28 percent of the group scored higher than Toni. Percentile ranks range from 1 to 99.
A student's percentile rank can vary depending on which group is used to determine the ranking. A student is simultaneously a member of many different groups: all students in her classroom, her building, her school district, her state, and the nation. Different sets of percentile ranks are available with the Iowa Tests of Basic Skills to permit schools to make the most relevant comparisons involving their students.
Psychometric Tests
Despite the controversy surrounding these psychometric personality tests, there has been a great increase in the use of personality tests over the past decade. The only given reason for increase in testing is the need to have a selection process which can withstand legal challenges. Increased test use can be seen in part as a defensive strategy, adopted in response to regulation and legislation.
There are no defined good profiles or bad profiles; everything depends on the personality characteristics the job position requires.
Online Test
Online testing is popular because of the ease of conducting such tests. This approach has clear advantages over paper-and-pencil tests:
There is no need for printing questionnaires and distributing printed material. This has actually lowered the cost of test administration.
Results are processed immediately with no human input. The test administration software produces detailed and impressive looking reports.
There is a wide acceptance of personality testing among the candidates. Many candidates happily complete online tests in their own time.
There are more suppliers now producing a greater variety of tests. This has brought costs down even further and increased the choice of tests available to recruiting organizations.
Importance of Tests
Different people have different characteristics. It could be psychological characteristics or physical characteristics. All these characteristics are called constructs. People with high skills in verbal and mathematical reasoning may be considered high on the mental ability. People with little stamina and strength may be considered low on endurance and physical strength. The terms mental ability, endurance and physical strength are constructs. Constructs are used to identify characteristics and to grade candidates based on those characteristics.
Constructs may not be seen or heard, but their effect on other variables can be observed. For example, we can observe that some people can work on complex numerical calculations without using paper and pen, while some struggle even with a pen and paper.
Types of Score Interpretation
An achievement test is built to help determine how much skill or knowledge students have in a certain area. We use such tests to find out whether students know as much as we expect they should, or whether they know particular things we regard as important. By itself, the raw score from an achievement test does not indicate how much a student knows or how much skill she or he has. More information is needed to decide "how much." The test score must be compared or referenced to something in order to bring meaning to it. That "something" typically is (a) the scores other students have obtained on the test or (b) a series of detailed descriptions that tell what students at each score point know or which skills they have successfully demonstrated. These two ways of referencing a score to obtain meaning are commonly called norm-referenced and criterion-referenced score interpretations.
Norm-Referenced Interpretation
A norm-referenced interpretation involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group. High standing is interpreted to mean the student knows a lot or is highly skilled, and low standing means the opposite. Obviously, the overall competence of the norm group affects the interpretation significantly. Ranking high in an unskilled group may represent lower absolute achievement than ranking low in an exceptional high performing group.
Criterion-Referenced Interpretation
A criterion-referenced interpretation involves comparing a student's score with a subjective standard of performance rather than with the performance of a norm group. Deciding whether a student has mastered a skill or demonstrated minimum acceptable performance involves a criterion-referenced interpretation. Usually percent-correct scores are used and the teacher determines the score needed for mastery or for passing.
Interpreting Scores from Special Test Administrations
A testing accommodation is a change in the procedures for administering the test that is intended to neutralize, as much as possible, the effect of the student's disability on the assessment process. The intent is to remove the effect of the disability(ies), to the extent possible, so that the student is assessed on equal footing with all other students. In other words, the score reflects what the student knows, not merely what the student's disabilities allow him/her to show.
The expectation is that the accommodation will cancel the disadvantage associated with the student's disability. This is the basis for choosing the type and amount of accommodation to be given to a student. Sometimes the accommodation won't help quite enough, sometimes it might help a little too much, and sometimes it will be just right. We never can be sure, but we operate as though we have made a good judgment about how extensive a student's disability is and how much it will interfere with obtaining a good measure of what the student knows. Therefore, the use of an accommodation should help the student experience the same conditions as those in the norm group. Thus, the norms still offer a useful comparison; the scores can be interpreted in the same way as the scores of a student who needs no accommodations.
A test modification involves changing the assessment itself so that the tasks or questions presented are different from those used in the regular assessment. A Braille version of a test modifies the questions just like a translation to another language might. Helping students with word meanings, translating words to a native language, or eliminating parts of a test from scoring are further examples of modifications. In such cases, the published test norms are not appropriate to use. These are not accommodations. With modifications, the percentile ranks or grade equivalents should not be interpreted in the same way as they would be had no modifications been made.
Certain other kinds of changes in the tests or their presentation may result in measuring a different trait than was originally intended. For example, when a reading test is read to the student, we obtain a measure of how well the student listens rather than how well he/she reads. Or if the student is allowed to use a calculator on a math estimation test, you obtain a measure of computation ability with a calculator rather than a measure of the student's ability to do mental arithmetic. Obviously in these situations, there are no norms available and the scores are quite limited in value. Consequently, these particular changes should not be made.
ANS:
Three of the fundamental purposes for testing are (1) to describe each student's developmental level within a test area, (2) to identify a student's areas of relative strength and weakness in subject areas, and (3) to monitor year-to-year growth in the basic skills. To accomplish any one of these purposes, it is important to select the type of score from among those reported that will permit the proper interpretation. Scores such as percentile ranks, grade equivalents, and standard scores differ from one another in the purposes they can serve, the precision with which they describe achievement, and the kind of information they provide. A closer look at these types of scores will help differentiate the functions they can serve and the meanings they can convey. Additional detail can be found in the Interpretive Guide for Teachers and Counselors.
In Iowa, school districts can obtain scores that are reported using national norms or Iowa norms. On some reports, both kinds of scores are reported. The difference is simply in the group with which comparisons are made to obtain score meaning. A student's Iowa percentile rank (IPR) compares the student's score with those of others in his/her grade in Iowa. The student's national percentile rank (NPR) compares that same score with those of others in his/her grade in the nation. For other types of scores described below, there are both Iowa and national scores available to Iowa schools.
Interpreting Test Results: Test results are generally presented in terms of numerical scores, such as raw scores, standard scores, and percentile scores. In order to interpret test scores effectively, the scoring system used needs to understood. The type of scores used is as follows:
I) Raw Scores: These refer to the unadjusted scores on the test. Usually the raw score is the number of questions answered correctly, as in mental ability or achievement tests. Some types of assessment tools and personality tests have no "right" or "wrong" answers. In such cases, the raw score may represent the number of positive responses for a particular trait. Raw scores do not provide any useful information. E.g. consider a candidate who gets 25 out of 50 questions correct on a test. It’s hard to know whether "25" is a good score or a poor score. When the results are compared with the other individuals who took the same test, you may discover that this was the highest score on the test.
The number of questions a student gets right on a test is the student's raw score (assuming each question is worth one point). By itself, a raw score has little or no meaning. The meaning depends on how many questions are on the test and how hard or easy the questions are. For example, if Kati got 10 right on both a math test and a science test, it would not be reasonable to conclude that her level of achievement in the two areas is the same. This illustrates why raw scores are usually converted to other types of scores for interpretation purposes.
II) Standard Scores: Standard scores are converted from raw scores. This indicates where a candidate’s score lies in comparison to a group. e.g. if the test indicates that the average or mean score for the group on a test is 50, then a candidate who gets a high score is above average, and a candidate who gets a low score is below average.
III) Percentile Score: A percentile score is another kind of converted score. A candidate’s raw score is converted into a number indicating the percent of employees in the group who scored below the test taker. E.g. a score at the 70th percentile means that the candidate’s score is the same as or higher than the scores of 70% of those who took the test. The 50th percentile is known as the median and represents the middle score of the distribution.
When the raw score is divided by the total number of questions and the result is multiplied by 100, the percent-correct score is obtained. Like raw scores, percent-correct scores have little meaning by themselves. They tell what percent of the questions a student got right on a test, but unless we know something about the overall difficulty of the test, this information is not very helpful. Percent-correct scores are sometimes incorrectly interpreted as percentile ranks, which are described below. The two are quite different.
Grade Equivalent (GE)
The grade equivalent is a number that describes a student's location on an achievement continuum. The continuum is a number line that describes the lowest level of knowledge or skill on one end (lowest numbers) and the highest level of development on the other end (highest numbers). The GE is a decimal number that describes performance in terms of grade level and months. For example, if a sixth-grade student obtains a GE of 8.4 on the Vocabulary test, his score is like the one a typical student finishing the fourth month of eighth grade would likely get on the Vocabulary test. The GE of a given raw score on any test indicates the grade level at which the typical student makes this raw score. The digits to the left of the decimal point represent the grade and those to the right represent the month within that grade.
Grade equivalents are particularly useful and convenient for measuring individual growth from one year to the next and for estimating a student's developmental status in terms of grade level. But GEs have been criticized because they are sometimes misused or are thought to be easily misinterpreted. One point of confusion involves the issue of whether the GE indicates the grade level in which a student should be placed. For example, if a fourth-grade student earns a GE of 6.2 on a fourth-grade reading test, should she be moved to the sixth grade? Obviously the student's developmental level in reading is high relative to her fourth-grade peers, but the test results supply no information about how she would handle the material normally read by students in the early months of sixth grade. Thus, the GE only estimates a student's developmental level; it does not provide a prescription for grade placement. A GE that is much higher or lower than the student's grade level is mainly a sign of exceptional performance.
In sum, all test scores, no matter which type they are or which test they are from, are subject to misinterpretation and misuse. All have limitations or weaknesses that are exaggerated through improper score use. The key is to choose the type of score that will most appropriately allow you to accomplish your purposes for testing. Grade equivalents are particularly suited to estimating a student's developmental status or year-to-year growth. They are particularly ill-suited to identifying a student's standing within a group or to diagnosing areas of relative strength and weakness.
Developmental Standard Score (SS)
Like the grade equivalent (GE), the developmental standard score is also a number that describes a student's location on an achievement continuum. The scale used with the ITBS and ITED was established by assigning a score of 200 to the median performance of students in the spring of grade 4 and 250 to the median performance of students in the spring of grade 8.
The main drawback to interpreting developmental standard scores is that they have no built-in meaning. Unlike grade equivalents, for example, which build grade level into the score, developmental standard scores are unfamiliar to most educators, parents, and students. To interpret the SS, the values associated with typical performance in each grade must be used as reference points.
The main advantage of the developmental standard score scale is that it mirrors reality better than the grade-equivalent scale. That is, it shows that year-to-year growth is usually not as great at the upper grades as it is at the lower grades. (Recall that the grade-equivalent scale shows equal average annual growth -- 10 months -- between any pair of grades.) Despite this advantage, the developmental standard scores are much more difficult to interpret than grade equivalents. Consequently, when teachers and counselors wish to estimate a student's annual growth or current developmental level, grade equivalents are the scores of choice.
The potentials for confusion and misinterpretation that were described in the previous subsection for the GE are applicable to the SS as well. Relative to the GE, the SS is not as easy to use in describing growth, but it is equally inappropriate for identifying relative strengths and weaknesses of students or for describing a student's standing in a group.
Percentile Rank (PR)
A student's percentile rank is a score that tells the percent of students in a particular group that got lower raw scores on a test than the student did. It shows the student's relative position or rank in a group of students who are in the same grade and who were tested at the same time of year (fall, midyear, or spring) as the student. Thus, for example, if Toni earned a percentile rank of 72 on the Language test, it means that she scored higher than 72 percent of the students in the group with which she is being compared. Of course, it also means that 28 percent of the group scored higher than Toni. Percentile ranks range from 1 to 99.
A student's percentile rank can vary depending on which group is used to determine the ranking. A student is simultaneously a member of many different groups: all students in her classroom, her building, her school district, her state, and the nation. Different sets of percentile ranks are available with the Iowa Tests of Basic Skills to permit schools to make the most relevant comparisons involving their students.
Psychometric Tests
Despite the controversy surrounding these psychometric personality tests, there has been a great increase in the use of personality tests over the past decade. The only given reason for increase in testing is the need to have a selection process which can withstand legal challenges. Increased test use can be seen in part as a defensive strategy, adopted in response to regulation and legislation.
There are no defined good profiles or bad profiles; everything depends on the personality characteristics the job position requires.
Online Test
Online testing is popular because of the ease of conducting such tests. This approach has clear advantages over paper-and-pencil tests:
There is no need for printing questionnaires and distributing printed material. This has actually lowered the cost of test administration.
Results are processed immediately with no human input. The test administration software produces detailed and impressive looking reports.
There is a wide acceptance of personality testing among the candidates. Many candidates happily complete online tests in their own time.
There are more suppliers now producing a greater variety of tests. This has brought costs down even further and increased the choice of tests available to recruiting organizations.
Importance of Tests
Different people have different characteristics. It could be psychological characteristics or physical characteristics. All these characteristics are called constructs. People with high skills in verbal and mathematical reasoning may be considered high on the mental ability. People with little stamina and strength may be considered low on endurance and physical strength. The terms mental ability, endurance and physical strength are constructs. Constructs are used to identify characteristics and to grade candidates based on those characteristics.
Constructs may not be seen or heard, but their effect on other variables can be observed. For example, we can observe that some people can work on complex numerical calculations without using paper and pen, while some struggle even with a pen and paper.
Types of Score Interpretation
An achievement test is built to help determine how much skill or knowledge students have in a certain area. We use such tests to find out whether students know as much as we expect they should, or whether they know particular things we regard as important. By itself, the raw score from an achievement test does not indicate how much a student knows or how much skill she or he has. More information is needed to decide "how much." The test score must be compared or referenced to something in order to bring meaning to it. That "something" typically is (a) the scores other students have obtained on the test or (b) a series of detailed descriptions that tell what students at each score point know or which skills they have successfully demonstrated. These two ways of referencing a score to obtain meaning are commonly called norm-referenced and criterion-referenced score interpretations.
Norm-Referenced Interpretation
A norm-referenced interpretation involves comparing a student's score with the scores other students obtained on the same test. How much a student knows is determined by the student's standing or rank within the reference group. High standing is interpreted to mean the student knows a lot or is highly skilled, and low standing means the opposite. Obviously, the overall competence of the norm group affects the interpretation significantly. Ranking high in an unskilled group may represent lower absolute achievement than ranking low in an exceptional high performing group.
Criterion-Referenced Interpretation
A criterion-referenced interpretation involves comparing a student's score with a subjective standard of performance rather than with the performance of a norm group. Deciding whether a student has mastered a skill or demonstrated minimum acceptable performance involves a criterion-referenced interpretation. Usually percent-correct scores are used and the teacher determines the score needed for mastery or for passing.
Interpreting Scores from Special Test Administrations
A testing accommodation is a change in the procedures for administering the test that is intended to neutralize, as much as possible, the effect of the student's disability on the assessment process. The intent is to remove the effect of the disability(ies), to the extent possible, so that the student is assessed on equal footing with all other students. In other words, the score reflects what the student knows, not merely what the student's disabilities allow him/her to show.
The expectation is that the accommodation will cancel the disadvantage associated with the student's disability. This is the basis for choosing the type and amount of accommodation to be given to a student. Sometimes the accommodation won't help quite enough, sometimes it might help a little too much, and sometimes it will be just right. We never can be sure, but we operate as though we have made a good judgment about how extensive a student's disability is and how much it will interfere with obtaining a good measure of what the student knows. Therefore, the use of an accommodation should help the student experience the same conditions as those in the norm group. Thus, the norms still offer a useful comparison; the scores can be interpreted in the same way as the scores of a student who needs no accommodations.
A test modification involves changing the assessment itself so that the tasks or questions presented are different from those used in the regular assessment. A Braille version of a test modifies the questions just like a translation to another language might. Helping students with word meanings, translating words to a native language, or eliminating parts of a test from scoring are further examples of modifications. In such cases, the published test norms are not appropriate to use. These are not accommodations. With modifications, the percentile ranks or grade equivalents should not be interpreted in the same way as they would be had no modifications been made.
Certain other kinds of changes in the tests or their presentation may result in measuring a different trait than was originally intended. For example, when a reading test is read to the student, we obtain a measure of how well the student listens rather than how well he/she reads. Or if the student is allowed to use a calculator on a math estimation test, you obtain a measure of computation ability with a calculator rather than a measure of the student's ability to do mental arithmetic. Obviously in these situations, there are no norms available and the scores are quite limited in value. Consequently, these particular changes should not be made.
No comments:
Post a Comment