|
| |
Dept. of Education > Office for Educational Review
> National Benchmarks > Establishing Comparable Benchmark Data
Establishing Comparable Benchmark Data
General Principles Involved In Establishing Nationally Comparable Benchmark Data
The process of establishing National Benchmarks from jurisdictions' monitoring-test programs essentially involves three steps:
- Establishing the relative difficulties of items from all jurisdictions' tests. This enables all items from the relative strand (e.g. reading) to be located on a common scale. This can be done by a process of 'intact testing', in which students in one jurisdiction sit tests from other jurisdictions.
This process is possible because all jurisdictions currently use the Rasch Model to analyse their students' results. The Rasch Model enables item (question) difficulties and student abilities to be 'mapped' onto the same scale. Because the Rasch Model has certain properties, it can be used to equate test results between jurisdictions and from year to year.
The process of equating tests essentially involves placing item difficulties from different tests onto the same scale. For example, in the intact testing program, item difficulties from all jurisdictions' tests, in Year 3 reading (say), were located on the same scale.
- Locating the benchmark on the common scale. This is done by a process of jurisdictions sending three experts to a central location (at present Sydney) where the benchmark standard is described and discussed. The judges then estimate for all reading items, say, the probability of a student of benchmark-standard ability answering each item correctly. Not only is this process mentally taxing and time consuming, but the judges' estimates are subjected to a fairly rigorous statistical analysis to estimate judges' consistency.
- Locating the benchmark 'cut score' (obtained from step 2) onto jurisdictions' own scales. From this, the proportion of students who have 'reached' the benchmark within a jurisdiction can be estimated.

Margins of Error
Because there was not 100% consensus by the benchmark judges on the exact location of the benchmark cut score, a measure of the variability of the judges' estimation was incorporated into the benchmark cut score.
Essentially, for each jurisdiction, three benchmark cut scores are calculated: the actual benchmark cut score (obtained by calculating the mean cut score of the judges) and an 'upper' and 'lower' cut score (the cut scores one standard deviation above and below the mean cut score, respectively). This margin of error essentially represents uncertainty of the precise location of the benchmark.
When reporting benchmark performances, each jurisdiction reports percentage of students above the cut score, and above the 'lower' and 'upper' cut scores. Thus a figure of 85% plus or minus 2% means that 85% of students scored above the (mean) cut score, 87% (85% + 2%) of students scored above the lower cut score, and 83% (85%-2%) scored above the upper cut score.
In some circumstances, errors due to sampling have to be calculated, but in most cases these are virtually negligible.
Interpolation
Usually, the benchmark cut scores are not 'exact' scores that could have been achieved by students. For example, the benchmark cut score, as a raw score, might be, say, 10.5, but students could score only 0, 1, 2, 3 .... 10, 11 ... 27, 28, and not 10.5. In these circumstances, a method of interpolation is used to estimate the percentage of students who, theoretically, scored 10.5 or more. This can be justified because some students who scored 10, say, might have been able to score 10.1, 10.2, 10.3 an so on had the measurement scale been sufficiently 'fine'.
In practice, raw scores are not used to calculate benchmark figures; instead Rasch-derived units called logits are used.
Several methods of interpolation were proposed. The most recent (used in estimating the 1999 Reading figures) involves a sophisticated Rasch analysis.
Equating Tests From Year To Year Within Jurisdictions
Once a benchmark cut score has been calculated for a jurisdiction for a strand and a year (e.g. reading for 1999, for Tasmania), each jurisdiction equates successive test results using the Rasch Model. Thus the 1999 Year 3 reading benchmark cut score calculated in 1999 for Tasmania can be mapped onto Tasmania's Year 3 reading scales in 2000, 2001, 2002 etc.
| |
|