NBPCB Board NOMC Certification NCLB Certification Approved Centers conferences Announcements Download Documents

 

NCLB Certification | Application | Study for Exam | What to Bring
Recertification | Find NCLBs | NLBCT Statistical Report

 

NLBCT Statistical Report

FR-08-99

September 29, 2008

 

NFB NLBCT Braille Test: Pilot Test Results

Gordon W. Waugh

Prepared for:
National Federation of the Blind
1800 Johnson Street
Baltimore, MD 21230

 

*** Note: Tables and appendices have been removed from this version of the report in order to protect the integrity of the test ***

 

Introduction

In the Fall of 2005, the National Federation of the Blind (NFB) took over development of the new National Literary Braille Competency Test (NLBCT) from the National Library Service within the Library of Congress (LOC/NLS). As designed by the LOC/NLS, the NLBCT consisted of four sections: slate & stylus, braillewriter, proofreading, and multiple choice. Each section had a separate passing score and to pass the test, candidates were required to pass each section. Each section also had a somewhat generous time limit. Three of the sections, the slate & stylus, braillewriter, and proofreading, were open-book; candidates were provided with the current version of English Braille American Edition. The multiple-choice section was closed book. The test is available in three formats: print, grade 1 Braille, and audio tape.

 

Most of the development had already been completed. LOC/NLS stopped development during the pilot testing stage. If the test, in its current form, met the goals of the NFB then little or no test development would be required. Therefore, a kickoff meeting was held to discuss the purpose and goals of the NFB’s testing program and how the NLBCT might need to be modified accordingly. The two-day meeting was held at NFB headquarters in Baltimore in November of 2005.

 

Modifications

At the kickoff meeting, a purpose statement for the test was drafted. The NFB agreed to keep the original format of four sections with separate passing scores. In addition to keeping the proofreading and braille writing sections open book, they also made the multiple-choice section open-book. Plans were also made to review the following materials and make the appropriate modifications:


* test blueprint
* scoring protocol
* passages for the slate, braillewriter, and proofreading sections
* multiple-choice items
* test instructions for each section
* candidate guide
* test administrator guide

 

HumRRO held a teleconference with several subject matter experts (SMEs) to discuss how the test blueprint should be modified. A handful of minor changes were made. Some specific braille rules covered by the original blueprint were dropped because they were considered either irrelevant or too advanced for NFB’s goals. In addition, one blueprint dimension (braille technology) was dropped. After the teleconference, SMEs reviewed the existing test content to make sure that it still reflected the test blueprint. No major content changes were required. The revised blueprint is provided in Appendix A.

 

A second teleconference was held to discuss how the scoring protocol for the passages should be modified. The scoring protocol consists of the rules for scoring the passages. For example, it states that only one point should be deducted when a candidate incorrectly brailles the same word several times in the same passage. Most of the discussion of the scoring protocol involved duplicate errors. The participants also discussed how candidates should mark errors in the proofreading passages. The final scoring protocol is in Appendix B.

 

A third teleconference was held to discuss how the test instructions should be modified. No major changes were made. The revisions consisted of making the instructions clearer. They included the following main topics: (a) marking errors in proofreading passages, (b) hyphenation, (c) line length, (d) paragraph formatting, and (e) how errors will be counted. The candidates’ instructions for the braillewriter and proofreading sections are in Appendices C and D, respectively.

 

Pilot Test Administration

Test forms were put together for an operational pilot test. The purpose of the pilot test was to evaluate the test content, supplemental materials, and testing procedures. In an operational pilot test, the scores count. This helps to make the pilot test similar to actual testing conditions and ensures that the pilot test participants are motivated to study for the test and do their best during the test administration—just like real candidates.

 

Each section of the test—the slate, braillewriter, proofreading, and multiple-choice had three different forms (A, B, and C). Each candidate completed one form for each of the first three sections, but because each multiple-choice form was brief (38 items) and each item appeared on only one form, candidates each completed two multiple-choice forms. The combination of the forms was counter-balanced as much as possible. The counterbalancing design is in Appendix E.

 

NFB administered the pilot test to 48 people between February 26 and March 3, 2008. The test was administered in three locations: Anaheim, Minneapolis, and Elkins Park, Pennsylvania. Participants took the sections in the following order: Braillewriter, Slate, Proofreading, and Multiple Choice. The Braillewriter and Slate were administered in the morning; the Proofreading and Multiple-Choice were administered in the afternoon. The time limits for the morning session were 1 hour for the slate and stylus section and 2 hours for the braillewriter; the afternoon time limits were two hours each for the proofreading and multiple-choice sections. The morning session was followed by a one-hour lunch break. This was an open-book exam. The test administrator provided candidates with a copy of the 1994 edition of the official braille handbook published by the Library of Congress: Literary Braille English Edition. Participants were asked to complete two forms of the multiple-choice test so that we could obtain statistics on as many items as possible. Before coming to the test administration, participants completed a survey of their braille experience, education, training, and teaching.

 

Cut Score Workshop

Seven SMEs participated in the cut-score workshop held at NFB headquarters Aug 7-8, 2006. All were very experienced as either braille users or braille teachers. Two different cut score procedures were used. For the multiple-choice items, the Angoff method was used. For the other test sections, the bookmark method was used.

 

Before the workshop, test developers evaluated the item statistics of the multiple-choice items. Two item statistics of interest are the difficulty level (percent of candidates who answer the item correctly) and the item-total correlation. The difficulty level should be reasonable; candidates have a 25% chance of guessing correctly so an item with a difficulty level near 25% is too hard. Items that are so easy that all candidates respond correctly do not provide reliable information about candidates’ ability. The item-total correlation is the strength of the relationship between whether or not the candidate got the item correct and his or her total score. A correlation value can range from -1 to 1. A good item has a positive item-total correlation; that is, people who responded correctly tended to do well overall. Items with very poor statistics were dropped. These were mostly items that were much too easy. All pilot participant, or all but one participant, got these easy items correct. Several items with negative or near-zero item-total correlations were also dropped. A zero item-total correlation indicates that high-scoring (on the multiple-choice form as a whole) and low-scoring pilot participants performed equally well on the item.

 

Before the Angoff procedure was done, the SMEs discussed the minimal competency statement to ensure that everyone would be using the standard of competence. Then HumRRO staff trained the SMEs on using the Angoff method. The Angoff method requires SMEs to make ratings for each item. The steps below were followed:

Discussion of minimal competency.


  1. Description of the Angoff method
  2. Read the first multiple-choice test item.
  3. SMEs made their first Angoff rating by answering the question, “If you had a room full of minimally competent examinees, how many of them would get this item correct?”
  4. SMEs wrote down their rating.
  5. Each SME announced their rating to the group.
  6. SMEs with relatively low or relatively high ratings explained why they gave the rating they did.
  7. HumRRO provided the SMEs with pilot test statistics for that item. The purpose of this step is to give the SMEs a reality check. For example, if only 50% of the pilot participants got the item correct but the SMEs want to give it a rating of 90%, then the SMEs will likely conclude that their ratings are too high.
  8. The SMEs further discussed their ratings.
  9. The SMEs make a second rating. There is no further discussion. This is the final rating for each SME.
  10. The Angoff rating for the item is the average of the seven SMEs’ ratings. The Angoff ratings were then stored in the item database.
  11. The cut score for any test form is simply the sum of Angoff ratings (where each rating is expressed as a proportion out of 1).

 

During this Angoff procedure, two other activities were completed for each item. First, the SMEs reviewed the statistics for several items. The item statistics showed that some items were clearly unacceptable and some items were clearly good items. Thus, the review was limited to the remaining items. Second, SMEs performed content validity ratings. For each retained item, they rated the importance of the knowledge tapped by the item. This was rated on a four-point scale from “Not important” to “Extremely important.” All items, except one, had a mean rating (across the seven SMEs) of at least 3 which represents “moderately important.” Item 6027025 had a mean importance rating of 2.9. No items were dropped because of low content validity ratings.

 

The cut scores for the other three section of the test were set using the bookmark method. A separate cut score was set for each of these sections for each of the three test forms. Thus, the bookmark process was done nine times (3 forms x 3 sections per from). In the bookmark method, the SME is given a pile of candidates’ completed test sections. The pile is sorted by section score. Because the score on these three sections is the number of errors, lower scores are better than higher scores. The SMEs’ task was to place a bookmark in the pile at the score that separates the competent candidates from the not competent candidates.

 

The SMEs were given not only the completed sections but also the scorer’s list of errors for each candidate. That simplified the SME’s judgment task. When all SMEs had placed their first bookmark, they announced the location of their bookmark to the other SMEs. During the discussion, the SME’s explained their reasoning for their bookmark placement. After the discussion, the SMEs were given the opportunity to move their bookmarks.

 

The SMEs’ final bookmarks were used to determine the cut score for a section. Each SME’s final bookmark was placed between two candidates. The median bookmark placement, among all the SMEs, was determined. A candidate’s score was on either side of this bookmark. The official passing score was the average of these two candidates’ scores. This score was rounded down to the nearest whole number. This procedure was followed for each section within each test form. The final passing scores are shown in Table 1 below. The passing scores were similar across test forms. The one exception was for the proofreading section. The SMEs judged Form A to be more difficult than the other two forms.

 

Pilot Test Results

Forty-eight candidates completed the pilot test. However, four candidates had extremely poor scores, and they were much worse than the other candidates. Therefore, all analyses were based on the remaining 44 candidates. Appendix F shows the summary statistics for the participants’ background survey. The average completion times, in minutes, were 94, 50, 75, 50 for the slate, braillewriter, proofreading, and multiple-choice sections, respectively. The mean duration for the entire test was 266 minutes with a minimum of 150 and a maximum of 390 (i.e., 6 hours and 30 minutes). These times are somewhat misleading because some candidates likely took the entire allotted time by checking their work thoroughly after finishing early. The descriptive statistics for each section’s scores are shown in Table 2 below.


Table 1. Pilot Test Passing Scores
*** Table has been removed ***

 

The main purpose of the analyses was to evaluate the reliability and construct validity of the four sections of the test. A test is reliable when it is measuring something. A test is valid when it is measuring what you want it to measure. For all sections of the test, evidence of reliability and construct validity was assessed by examining the correlations between the section scores. For the multiple-choice section, further evidence of reliability was assessed by computing an estimate of internal consistency reliability. Finally, the quality of the multiple-choice items was evaluated by computing item difficulty (i.e., proportion of candidates answering the item correctly) and the item-total correlation (i.e., the strength of the relationship between the score on the item and the total score).

 

Multiple-Choice Items

For each multiple-choice item, the item difficulty and item-total correlation were computed. These statistics are shown in Appendix G. Several option statistics were computed as well. These are not shown in the report but are contained in the MS Access item database and are available upon request.

 

Overall, the items were very easy for the candidates. The average item difficulty was .89 (which means that 89% of the candidates answered the item correctly). Half of the items had a difficulty above .93. We decided to drop the 45 items with a difficulty above .95 because items this easy cannot reliably measure candidates’ ability. We dropped an additional 23 items because their low item-total correlations were evidence that these items were not measuring the same thing as the other items. Thus, the number of scored items in Forms A, B, and C were 9, 21, and 25, respectively. The candidates’ total multiple-choice scores were based only on these items.

 

Test Sections

The correlations among the test sections are shown in Table 3 below. The correlation between the multiple-choice and proofreading sections is relatively high at r = .60. All other correlations are moderate. The next highest correlation is between the slate and braillewriter sections at r = .46. The braillewriter and slate sections involve writing braille whereas the other two sections involve reading braille. Thus, these two correlations make sense. The between-section correlations are small enough to conclude that the four sections are not redundant. They are very likely measuring different aspects of braille ability. Overall, the moderate correlations are good evidence of the construct validity of all four sections of the test.

 

Table 2. Section Score Statistics
*** Table has been removed ***
Table 3. Correlations Between Sections
*** Table has been removed ***

 

Test Forms

Because candidates took only one form of each section (except for multiple-choice), we cannot directly compute correlations between forms. However, we can compute the correlation between the section scores and each form within a section correlation. Table 4 shows these correlations. If all three forms are measuring the same thing, then their correlations with the other sections should be about the same. Because of the small sample sizes, these correlations should be considered suggestive rather than definitive—particularly when you consider that different correlations involve different candidates.

 

It looks like Slate Form B correlates less with the other sections than do Slate A or C. Thus, Form B might be measuring something slightly different. Because of the small sample size, it’s difficult to draw any conclusions about the other sections.

 

Table 4. Correlations With Forms
*** Table has been removed ***

 

Internal consistency reliability was estimated using coefficient alpha for each multiple-choice form. The reliability estimates were .71, .91, and .73 for Forms A, B, and C, respectively. We also computed Livingstone’s index of criterion-referenced reliability which estimates the reliability of the pass/fail decision. Livingstone’s index was .79, .91, and .74 for Forms A, B, and C, respectively.

 

Conclusions and Recommendations

Overall, all of the forms of the slate, braillewriter, and proofreading sections of test appear to have acceptable reliability and construct validity. They correlate with each other and with the multiple-choice section. It would be useful to collect data as the test is administered to obtain more stable correlations between forms. In particular, the correlations involving Slate Form B should be observed to see if it might be measuring something slightly different from the other Slate forms.

 

The bulk of the items in the multiple-choice forms are much too easy. When these items were developed by the NLBCT, it was assumed that the test would be closed-book. Thus, the high mean score is not surprising. We suggest that NFB consider making the multiple-choice section of the test closed book. This would make it more difficult and reliable. In contrast, the difficulty of the other sections appears to be at the appropriate level.

 

The scoring protocol (see Appendix B) for the slate, braillewriter, and proofreading sections of the test is fairly complex. Therefore, we suggest that the scorers be carefully trained. We also suggest that two scorers be used to increase the reliability of the scoring. Data should be collected and analyzed to (a) help improve the scoring protocol and (b) assess the reliability of the human scoring and (c) determine which scorers might need additional training.

 

*** All appendices have been removed ***