This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF) Free
Right arrow Letters to the Editor: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Letters to the Editor are posted
Right arrow Alert me if a correction is posted
Services
Right arrow E-mail this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My File Cabinet
Right arrow Download to citation manager
Right arrow Rights and Permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by NEYT, J. G.
Right arrow Articles by SATERBAK, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by NEYT, J. G.
Right arrow Articles by SATERBAK, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Facebook   Add to Technorati   Add to Twitter  
What's this?
The Journal of Bone and Joint Surgery 81:1209-16 (1999)
© 1999 The Journal of Bone and Joint Surgery, Inc.

Stulberg Classification System for Evaluation of Legg-Calvé-Perthes Disease: Intra-Rater and Inter-Rater Reliability*

JEROEN G. NEYT, M.D.{dagger}, STUART L. WEINSTEIN, M.D.{dagger}, KEVIN F. SPRATT, PH.D.{dagger}, LORI DOLAN, R.N., M.A.{dagger}, JOSÉ MORCUENDE, M.D., PH.D.{dagger}, FREDERICK R. DIETZ, M.D.{dagger}, GREG GUYTON, M.D.{dagger}, ROBERT HART, M.D.{dagger}, MICHELLE STEVENS KRAUT, M.D.{dagger}, GREGORY LERVICK, M.D.{dagger}, PETER PARDUBSKY, M.D.{dagger} and ANDREA SATERBAK, M.D.{dagger}, IOWA CITY, IOWA

Investigation performed at the University of Iowa Hospitals and Clinics, Iowa City


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
Background: Researchers and clinicians commonly use the classification system of Stulberg et al. as a basis for treatment decisions during the active phase of Legg-Calvé-Perthes disease because of its putative utility as a predictor of long-term outcome. It is generally assumed that this system has an acceptable degree of reliability. This assumption, however, is not convincingly supported by the literature. Methods: The purpose of the present study was to assess the inter-rater and intra-rater reliability of the classification system of Stulberg et al. with use of a pre-test, post-test design. During the pre-test phase, nine raters independently used the system to evaluate the radiographs of skeletally mature patients who had been managed for Legg-Calvé-Perthes disease. The intervention between the pre-test and post-test phases consisted of a consensus-building session during which all raters jointly arrived at standardized definitions of the various joint structures that are assessed with use of the classification system. The effect of these definitions on reliability then was assessed by reevaluating the radiographs during the post-test phase. Results: The pre-test intra-rater reliability coefficients ranged from 0.709 to 0.915, and the post-test coefficients ranged from 0.568 to 0.874. The pre-test inter-rater reliability coefficients ranged from 0.603 to 0.732, and the post-test coefficients ranged from 0.648 to 0.744. Contributing to the variance was a lack of agreement concerning the assessment of joint structures and the way in which the raters translated these evaluations into a classification according to the system of Stulberg et al. Conclusions: Although intra-rater reliability was marginally acceptable, the degree of variability between the classifications assigned by different raters—even after the intervention—calls into question the reliability of the system of Stulberg et al.; consequently, the validity of any treatment decisions, outcome evaluations, or epidemiological studies based on this system is also in question.


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
The classification system of Stulberg et al.17 is used to predict the onset of degenerative joint disease following Legg-Calvé-Perthes disease9 and has been used to validate the lateral pillar classification7,8. Stulberg et al., who first described the system in 1981, proposed that the radiographic appearance of the hip at maturity predicts the potential for, and the onset of, degenerative joint disease. In this system, several radiographic parameters (the sphericity of the femoral head, the length of the femoral neck, the slope of the acetabulum, and the presence of coxa magna) are evaluated and an algorithm is used to classify the hip into one of five categories. These categories, in turn, represent three types of congruency between the femoral head and the acetabulum: spherical congruency (classes I and II), aspherical congruency (classes III and IV), and aspherical congruency (class V). Stulberg et al. presented evidence regarding the validity of the classification system (as reflected by the correspondence between the assigned ratings and the degree of osteoarthritis, the age at onset, the Mose rating12, epiphyseal involvement, subluxation, and functional limitations) and suggested that each class was associated with a "predictable future clinical and radiographic course," but they offered no discussion or evidence of the reproducibility of evaluations based on the system. Neither the definitions of the joint structures nor the algorithm is well defined. For example, no definition is given for a flat femoral head, an abnormally steep acetabulum, or a short femoral neck. Nevertheless, the system of Stulberg et al. has been used to determine the long-term prognosis for patients who have Legg-Calvé-Perthes disease and to evaluate the results of a number of alternative treatments1,3,9-11,13,18.

Despite the widespread use of the system of Stulberg et al.17, we are aware of only two reports concerning its reliability. Herring et al.7 estimated the intra-rater reliability to be 0.32 and the reliability of the average classification assigned by fifteen raters to be 0.87. Those authors advised that at least nine raters would be needed in order to achieve a consensus rating with a reliability of 0.80. However, they did not detail the types of coefficients that were calculated or the design that was used to generate the data for the estimations. That same work was cited by Herring et al. four years later8, but at that time the reliability was reported as 0.98. Martinez et al.11 calculated intraclass correlation coefficients and kappa statistics to estimate the reliability of several radiographic classification systems, including that of Stulberg et al., but stated only that the agreement was better than that expected by chance alone (p < 0.006). The magnitudes of the coefficients were not provided. Farsetti et al.4 reported the intraobserver agreement of five raters as 100 percent and the interobserver agreement as 98 percent. Again, no details were provided concerning these estimations. Those studies provided little evidence to bolster confidence in decisions based on this classification system. This lack of evidence calls into question the comparability of research results as well as the validity of treatment decisions, outcome evaluations, and epidemiological studies that are based on the system of Stulberg et al.

Because of these concerns, we believed that it was necessary to evaluate the reliability of the system of Stulberg et al.17 before using it as part of an ongoing long-term follow-up study initiated by McAndrew and the senior one of us (S. L. W.)10. The goals of the present study were (1) to estimate the inter-rater and intra-rater reliability of the system of Stulberg et al. before and after educational intervention and (2) to identify and quantify sources of nonrandom error.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 

Experimental Design
A pre-test, post-test design was used. In the pre-test phase, the inter-rater and intra-rater reliability of the ratings suggested by Stulberg et al.17 and of the radiographic evaluations of the femoral head, femoral neck, and acetabulum were estimated. Given the lack of explicit criteria in the original article, we hypothesized that a consensus regarding the definitions and the algorithm would have a positive effect on reliability. Therefore, all of the raters attended a consensus-building session during which they jointly arrived at standardized definitions of the various joint structures used in the algorithm. This intervention was followed by two additional readings of the radiographs and calculation of the same reliability estimates. The design of the study allowed for the quantification and evaluation of several potential sources of inter-rater and intra-rater error, including differences in the experience levels of the raters; errors in the assessment of the head, neck, and acetabulum; and errors due to the assignment of an incorrect classification on the basis of the radiographic observations. In order to assess the generalizability of the definitions and the algorithm from one sample to another as well as to determine the validity of the reliability estimates, we compared the estimates for a set of radiographs that was evaluated during both the pre-test and the post-test phase with those for a second set that was evaluated during only the post-test phase.

Pre-Test Evaluation (Readings 1 and 2)
Before the pre-test evaluation, each rater was given a copy of the classification system to study and memorize. The raters understood that they would be asked to evaluate a set of radiographs with use of the system and that they would not be allowed to consult the reference material during the evaluation. Each rater assessed all hips independent of the other raters during a single session. A single femoral head gauge was used for all readings. After all of the raters had evaluated the radiographs once, the radiographs were reordered and renumbered in a random fashion. Seven to ten days later, the raters reevaluated the radiographs under the same conditions. These two readings (Readings 1 and 2) constituted the pre-test evaluation.

Intervention: Consensus-Building Session
All of the raters met for approximately ninety minutes. Using radiographs that were associated with a particularly low level of agreement during the pre-test evaluation, all participants discussed and agreed to specific definitions for abnormalities of the head, neck, and acetabulum and arrived at a consensus regarding the proper use of the algorithm (Fig. 1).



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 1 Diagram illustrating the consensus algorithm used in the present study.

 

Consensus Definitions and Algorithm
The definitions provided by Stulberg et al.17 and those derived during the consensus-building session were summarized (Table I). It should be noted that the definitions and the algorithm that were derived during the consensus-building session were designed to allow for the evaluation of participants in our ongoing long-term study, which includes patients who have conditions (such as bilateral disease and minimum degenerative changes) that were not considered in the original article by Stulberg et al. It also should be noted that we included patients who had radiographs other than anteroposterior and frog-leg lateral views.


View this table:
[in this window]
[in a new window]
 
TABLE I DEFINITIONS FOR THE JOINT STRUCTURES INVOLVED IN THE CLASSIFICATION SYSTEM OF STULBERG ET AL.17

 

Post-Test Evaluation (Readings 3 and 4)
During the post-test evaluation, the same standardized rating protocol was used to assess the radiographs of the same twenty-three hips that had been classified during the pre-test phase (Set A) as well as the radiographs of twenty-two additional hips that had not yet been classified by the raters (Set B). The raters were instructed to memorize the consensus definitions (Table I) and the algorithm before participating in the post-test evaluation.

Rating Protocol
Data were collected with use of a standard rater-response sheet that required each rater to evaluate several items with use of published terminology17: (1) the sphericity of the femoral head (spherical, ovoid, mushroom-shaped, umbrella-shaped, or flat), (2) the length of the femoral neck (normal or abnormally short), (3) the steepness of the acetabulum (normal or abnormally steep), (4) the presence of coxa magna (yes or no), and (5) the classification according to the system of Stulberg et al.17 (class I, II, III, IV, or V).

Radiographs
From the sample described previously10, the senior one of us selected twenty-three patients for whom radiographs had been made at or after skeletal maturity. The average age of the selected patients was forty-one years (range, sixteen to sixty-four years) at the time that the radiographs were made. Each patient's series of radiographs consisted of an anteroposterior radiograph of the pelvis or an anteroposterior radiograph of the hip with or without a frog-leg or true lateral radiograph of the hip. When possible, the classification was based on an anteroposterior radiograph and a lateral radiograph; however, when both of these views were not available, the classification was based solely on an anteroposterior radiograph. In order to evaluate the effects of patient and radiographic characteristics on the reliability estimates, the radiographs also were classified as standard or nonstandard. Patients who had standard radiographs mirrored the patients in the sample described by Stulberg et al.17 in that they had no signs of osteoarthritis, had unilateral involvement, and had anteroposterior and frog-leg lateral radiographs available for analysis. Of the twenty-three series of radiographs in Set A, nine were classified as standard and fourteen were classified as nonstandard. Of the twenty-two series of radiographs in Set B, eight were classified as standard and fourteen were classified as nonstandard. The average age of the patients who had standard radiographs was forty years (range, twenty-eight to fifty-five years), and the average age of those who had nonstandard radiographs was forty-two years (range, sixteen to sixty-four years).

Identifying information was concealed, and the radiographs were randomly ordered and numbered within each of the two sets. The hip joint to be evaluated was identified with an X. All radiographs were handled by one of us (L. D.) who did not participate in the classifications.

Raters
Nine individuals representing three levels of experience (three staff pediatric orthopaedic surgeons, three senior residents, and three junior residents) participated in the classifications. None of the raters had participated in any aspect of the patients' clinical care.

Statistical Analyses
Reliability was estimated with use of generalizability coefficients5,16 that were generated with use of the GENOVA software package (version 2.2; American College Testing Program, Iowa City, Iowa). The mean square estimates used in the GENOVA program were computed with use of the GLM procedure in the SAS system (version 6.12 for OS/215; SAS Institute, Cary, North Carolina, 1995).

Inter-rater and intra-rater reliability coefficients were calculated to simulate a clinical situation in which one clinician evaluates a radiograph once and then assigns a classification. The pre-test phase involved the calculation of inter-rater and intra-rater reliability coefficients for the overall scale as well as generalized unweighted kappa statistics2,6 for the evaluations of the head, neck, acetabulum, and coxa magna. As an alternative method of describing the degree of reproducibility, the adjusted percentage agreement was calculated for the overall scale as well as for the evaluations of the joint structures. The same estimates were calculated during the post-test phase.

The number of misapplications of the algorithm (that is, the number of times that the algorithm dictated a classification that differed from the one that was actually assigned) was calculated as a percentage of the total number of readings during the pre-test and post-test phases, and these percentages were compared with use of repeated-measures analysis of variance for categorical variables. The observed classification distributions were calculated for each reading during the pre-test and post-test phases.


    Results
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 

Classifications According to System of Stulberg et al.17

Pre-Test Evaluation
When the classifications that were assigned by all nine raters were considered, the inter-rater reliability coefficient was 0.843 (adjusted percentage agreement, 69 percent). The inter-rater coefficient was 0.665 for the group with a high level of experience, 0.732 for the group with a moderate level of experience, and 0.603 for the group with a low level of experience (Table II). The intra-rater coefficients within each subgroup (range, 0.709 to 0.915) were generally higher than the inter-rater coefficients, indicating that raters were more consistent with themselves than they were with the other individuals in their subgroup.


View this table:
[in this window]
[in a new window]
 
TABLE II SUMMARY OF INTER-RATER AND INTRA-RATER RELIABILITIES FOR THE PRE-TEST AND POST-TEST EVALUATIONS

 

Post-Test Evaluation
The effect of the consensus-building session was evaluated by comparing the pattern and magnitude of the coefficients calculated during the pre-test and post-test evaluations. In general, there was no notable difference in the pattern and magnitude of the coefficients, suggesting that the intervention was not particularly effective. In general, the differences in reliability between the two tests were not large in either direction.

The reliability estimates associated with the nonstandard radiographs generally were lower than those associated with the standard radiographs. When the data from all nine raters were considered, the inter-rater reliability coefficient was 0.783 for the standard set compared with 0.602 for the nonstandard set. The intra-rater reliability ranged from 0.659 to 0.952 for the standard set and from 0.378 to 0.921 for the nonstandard set.

The inter-rater reliability coefficients for the Set-A radiographs (0.744, 0.648, and 0.669 for the groups with a high, moderate, and low level of experience, respectively) generally were higher than those for the Set-B radiographs (0.733, 0.646, and 0.619 for the groups with a high, moderate, and low level of experience, respectively). The intra-rater coefficients for the Set-A radiographs also tended to be higher than those for the Set-B radiographs. Again, these differences were small.

Radiographic Evaluations
Generalized unweighted kappa coefficients were used to estimate the consistency of the raters' evaluations of the head, the neck, the acetabulum, and the presence of coxa magna (Table III). When the data for all raters were considered, the pre-test reliability coefficients ranged from 0.570 (length of neck) to 0.787 (coxa magna). The post-test coefficients ranged from 0.559 (length of neck) to 0.758 (acetabular slope). No substantial change was noted between the pre-test and post-test evaluations of the femoral neck. An improvement in reliability was observed in association with the evaluations of the acetabulum in the groups with moderate and low levels of experience as well as in the overall group. A substantial decrease in reliability was noted in association with the evaluations of the femoral head in the group with a low level of experience as well as in association with the evaluations of coxa magna in the group with a high level of experience.


View this table:
[in this window]
[in a new window]
 
TABLE III GENERALIZED KAPPA ESTIMATES OF PRE-TEST AND POST-TEST INTER-RATER AGREEMENT FOR RADIOGRAPHIC EVALUATIONS (SET-A RADIOGRAPHS)

 
In only four instances were the kappa coefficients that were associated with the standard radiographs substantially higher than those that were associated with the nonstandard radiographs. Specifically, the coefficients that were associated with the standard radiographs were higher for the evaluations of the femoral head in the group with a low level of experience, for the evaluations of the femoral neck in the group with a moderate level of experience, and for the evaluations of coxa magna in the groups with moderate and low levels of experience.

Implementation of Algorithm
In order to provide an index of the raters' accuracy in applying the algorithm, the classification that was actually assigned was compared with the classification that should have been assigned on the basis of the evaluations of the joint structures. During the pre-test phase, thirty-six (9 percent) of the 414 assigned classifications did not match the classification that was dictated by the algorithm suggested by Stulberg et al.17. Nineteen (53 percent) of the thirty-six misapplications were instances in which the rater assigned class V but the algorithm dictated class IV. Thirty misapplications (83 percent) were due to an incorrect integration of the assessment of the acetabulum, the femoral neck, or the presence of coxa magna. Twenty-three misapplications (64 percent) were attributable to two raters who had a moderate level of experience. The differences ranged from less than 1 percent (class I, observed less often than predicted) to 4 percent (class IV, observed more often than predicted).

During the post-test phase, twenty-one (5 percent) of the 414 assigned classifications did not match the classification that was dictated by the consensus algorithm. Although the prevalence of misapplications during the post-test phase (5 percent) was lower than that during the pre-test phase (9 percent), this difference could not be shown to be significant, with the numbers available (p < 0.09). Fourteen (67 percent) of the twenty-one misapplications were instances in which the rater assigned class IV when the algorithm dictated class V, which was the reverse of the problem that was associated with these two classes during the pre-test phase. Seventeen misapplications (81 percent) were due to an incorrect integration of the assessment of the acetabulum, the femoral neck, or the presence of coxa magna. Eight misapplications (38 percent) were attributable to one rater who had a low level of experience. The differences ranged from 0 percent (class I) to 4 percent (class IV, observed more often than predicted).

The largest differences in percentages across the four readings of the Set-A radiographs were for class III (52 percent for Reading 2 compared with 36 percent for Reading 3) and class IV (28 percent for Reading 2 compared with 39 percent for Reading 3) (Table IV). After the consensus-building session, raters tended to classify more hips as classes II and IV but fewer hips as classes I, III, and V.


View this table:
[in this window]
[in a new window]
 
TABLE IV DISTRIBUTION OF CLASSIFICATIONS17 ASSIGNED DURING EACH READING OF THE SET-A RADIOGRAPHS*

 


    Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 
The results of the present study suggest that the system of Stulberg et al.17 is not a highly reliable tool for the evaluation of the radiographs of patients who are managed for Legg-Calvé-Perthes disease at or after skeletal maturity. There was marked inter-rater variance among the nine physicians regardless of their level of experience. Although the intra-rater reliability coefficients were higher, they still did not reach the magnitude expected for most of the parameters that were assessed. The intra-rater reliability coefficients that were estimated in the present study were higher than the coefficient of 0.32 that was reported by Herring et al.7 but, because of a lack of information, they cannot be compared with those estimated by Martinez et al.11. Furthermore, neither of the previous reports addressed reliability in the clinical situation, in which one physician makes an inference on the basis of a single evaluation of a radiograph. This criticism pertains not only to the studies just mentioned but also to most reliability studies that have been reported in the literature to date.

Sources of Error

Expertise
One might expect that experience in the treatment of Legg-Calvé-Perthes disease would have a positive effect on the reliability of this classification system. However, the staff physicians misapplied the algorithm a total of seventeen times, whereas the junior residents did so only fourteen times. A crude look at the reliability estimates shows that, although the staff physicians tended to be more consistent with themselves and within their subgroup, experience had no uniform effect on reliability. In many cases, the junior residents performed as well as or better than the more experienced raters. Moreover, the consensus-building session was not uniformly effective in improving reliability in any of the three subgroups of raters. There was evidence that knowledge obtained outside of the study may have come into play, as the staff physicians assigned classes I and V less frequently than did the residents during both the pre-test and the post-test phase.

Standard and Nonstandard Radiographs
Another source of variation stems from the characteristics of the patients and the radiographs. We evaluated and compared both standard and nonstandard radiographs in order to determine the usefulness of the classification system in the clinical setting, in which patients and radiographs are not standardized. We believed that the presence of bilateral disease or osteoarthritis or the lack of a second radiograph would not diminish the raters' ability to evaluate the joint structures reliably (except with regard to the presence of coxa magna). Contrary to these notions, the reliability estimates were higher for the radiographs of patients who met the criteria described by Stulberg et al.17. However, the overall inter-rater reliability coefficient of 0.783 and the adjusted percentage agreement of 68 percent for the standard radiographs indicate a marked variability in the assessments even under standard conditions. This was true not only for the assignment of the classifications but also for the evaluations of the joint structures.

Definitions and Algorithms
A primary source of disagreement in the current study stemmed from the ambiguity of the definitions of normality for the joint structures. This was reflected by the low reliability coefficients that were calculated during the pre-test phase and also was evident during the consensus-building session. The raters noted the ambiguity of the terms that were used in the original article, such as mushroom-shaped, abnormally steep, and abnormally short. It seems obvious that if no definition has been published (for example, for a mushroom or ovoid-shaped femoral head or for the range of acetabular angles that qualify as steep) then each researcher will make these judgments independently, with no guarantee of consistency among studies. Even the intervention described in the present study had little effect, as reflected by the fact that there was little appreciable change between the reliability coefficients or the percentage agreement observed during the pre-test phase, when raters used their own interpretations of the system of Stulberg et al.17, and those observed during the post-test phase, after a consensus had been reached. Specifically, there was no change in the level of inter-rater agreement for the evaluations of the femoral neck, and there was a decrease in the level of agreement for the evaluations of coxa magna in the group with a high level of experience as well as for the evaluations of the femoral head in the group with a low level experience. Although there was an increase in the level of agreement for the evaluations of the acetabulum in the overall group as well as in the groups with moderate and low levels of experience, the pre-test coefficients for the latter two groups were relatively low. These findings suggest that either the consensus definitions were still not entirely clear or did not provide mutually exclusive categories, or the nature of the classification system itself does not allow for exclusive categorization.

As another method of gaining insight into the discrepancies that were observed, we assessed the agreement between the classifications that were assigned by the raters and those that were dictated by the algorithm. During the pre-test phase, there were thirty-six instances in which the classification that was assigned by the rater did not match the one that was dictated by the algorithm. There were fewer misapplications (twenty-one) during the post-test phase, but this difference could not be shown to be significant (p < 0.09), indicating that the time that was spent to reach a consensus and to diagram the algorithm was not beneficial in terms of the proper use of the algorithm. However, a constant finding across all tests was the discrepancy between the expected and observed ratings for class-IV and V hips. Twenty-three (64 percent) of the thirty-six misapplications during the pre-test phase and fourteen (67 percent) of the twenty-one misapplications during the post-test phase concerned these two classes.

Although the consensus-building session had little effect on reliability, it did have an effect on the distributions of the classifications. A comparison between the pre-test and post-test distributions showed substantial variation in the frequencies of all classes. Although there was a fair amount of variation during both testing periods, there is evidence that the definitions and the algorithm that were adopted during the consensus-building session had a direct influence on the classifications that were assigned during the post-test phase. These comparisons quantify the theoretically obvious notion that definitions have a great influence on the resulting classifications and also call into question comparisons across studies in which there has been no attempt at standardization.

These three sources of variance—differing definitions, differing algorithms, and inconsistency in the application of the algorithms—may, in and of themselves, be responsible for the varying distributions found in the literature. In published studies regarding the outcome of Legg-Calvé-Perthes disease after various treatment protocols, the prevalence of class-IV hips, for example, ranges from 6 percent (three of forty-eight) to 32 percent (thirty-two of ninety-nine) and the prevalence of class-V hips ranges from 0 percent (zero of ninety-three) to 19 percent (fourteen of seventy-two)1,3,8,9,17,18. Therefore, it is not unreasonable to believe that the differences in reported outcomes may be due not only to treatment but also to the strong influence of varying definitions and algorithms.

External Validity
The conditions that were imposed in the current study probably resulted in overestimations of true reliability and also raise issues concerning the degree to which these results can or should be generalized to other situations. The participants knew that their ratings would be evaluated and compared with those assigned by their peers; this may have produced a Hawthorne effect14, increasing the motivation and commitment of the raters. Conversely, as a large number of radiographs had to be assessed, the raters could have become tired or could have lost enthusiasm, especially during the post-test phase, which involved the evaluation of the radiographs of forty-five patients. In order to simulate a clinical situation and to assess interpretations of the original article17, the article was not available to the raters while they were evaluating the radiographs; nevertheless, because they were asked to review and memorize the system and because they were guided by the rating sheet, they probably were more knowledgeable about the system than the average orthopaedic surgeon is. The consensus definitions and the algorithm agreed to by the raters at our institution may not be the same as those agreed to by another group of physicians. The inconsistent pattern of differences between the coefficients associated with the Set-A and Set-B radiographs during the post-test phase demonstrates two points. First, there was no evidence that recall bias artificially inflated the coefficients associated with the Set-A radiographs. Second, neither set was associated with systematically higher coefficients, thereby ruling out the role of a specific set of radiographs in the overall evaluation of reliability. Therefore, several conditions in the present study may have improved the internal validity of the study at the expense of the external validity, or generalizability, of the results. It is our opinion that the reliability of the classification system may be even lower in clinical practice.

Future Directions
Because of their sensitivity to differing definitions and algorithms, the ratings described by Stulberg et al.17 are likely to be incomparable across physicians and studies when used to indicate intermediate outcome or to predict long-term outcome. Given this lack of comparability, we wonder whether the system is useful. If it is, then additional research is needed in order to refine and improve the clinical usefulness of the system or perhaps to construct a more precise system using the coincident concepts of sphericity and congruence.

The lesson of the present study is not necessarily that the classification system of Stulberg et al.17 is unreliable but that years of assuming that it was reliable may have misled investigators when reporting results or making treatment decisions. To avoid being misled in the future, researchers should remember that reliability must be assessed before any measure is recommended for widespread use and must be reevaluated before the measure is used for new applications and new populations.


    Footnotes
 
*No benefits in any form have been received or will be received from a commercial party related directly or indirectly to the subject of this article. No funds were received in support of this study.

{dagger}Department of Orthopaedic Surgery, University of Iowa Hospitals and Clinics, 200 Hawkins Drive, Iowa City, Iowa 52242-1009. E-mail address for Dr. Weinstein: stuart-weinstein@uiowa.edu.


    References
 Top
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 References
 

  1. Coates, C. J.; Paterson, J. M. H.; Woods, K. R.; Catterall, A.; and Fixsen, J. A.: Femoral osteotomy in Perthes' disease. Results at maturity. J. Bone and Joint Surg., 72-B(4): 581-585, 1990.
  2. Cohen, J. A.: A coefficient of agreement for nominal scales. Educat. and Psychol. Measure., 20: 37-46, 1960.
  3. Cooperman, D. R., and Stulberg, S. D.: Ambulatory containment treatment in Perthes' disease. Clin. Orthop., 203: 289-300, 1986.
  4. Farsetti, P.; Tudisco, C.; Caterini, R.; Potenza, V.; and Ippolito, E.: The Herring lateral pillar classification for prognosis in Perthes disease. Late results in 49 patients treated conservatively. J. Bone and Joint Surg., 77-B(5): 739-742, 1995.
  5. Feldt, L. S., and Brennan, R. L.: Reliability in generalizability theory. In Educational Measurement. Part I: Theory and General Principles, edited by R. L. Linn. Ed. 3, pp. 127-141. Phoenix, The Oryx Press, 1993.
  6. Fleiss, J. L.: Measuring nominal scale agreement among many raters. Psychol. Bull., 76: 378-382, 1971.
  7. Herring, J. A.; Hair, M.; Short, D.; Browne, R.; and The Legg-Perthes Study Group. Comparison of a computerized system of analysis (Gross-Harry) of Legg-Perthes radiographs with radiographic measurements by physicians. In Behavior of the Growth Plate, pp. 393-400. Edited by H. K. Uhthoff and J. J. Wiley. New York, Raven Press, 1988.
  8. Herring, J. A.; Neustadt, J. B.; Williams, J. J.; Early, J. S.; and Browne, R. H.: The lateral pillar classification of Legg-Calvé-Perthes disease. J. Pediat. Orthop., 12: 143-150, 1992.[Medline]
  9. Ippolito, E.; Tudisco, C.; and Farsetti, P.: The long-term prognosis of unilateral Perthes' disease. J. Bone and Joint Surg., 69-B(2): 243-250, 1987.[Abstract/Free Full Text]
  10. McAndrew, M. P., and Weinstein, S. L.: A long-term follow-up of Legg-Calvé-Perthes disease. J. Bone and Joint Surg., 66-A: 860-869, July 1984.[Abstract/Free Full Text]
  11. Martinez, A. G.; Weinstein, S. L.; and Dietz, F. R.: The weight-bearing abduction brace for the treatment of Legg-Perthes disease. J. Bone and Joint Surg., 74-A: 12-21, Jan. 1992.[Abstract/Free Full Text]
  12. Mose, K.: Methods of measuring in Legg-Calvé-Perthes disease with special regard to the prognosis. Clin. Orthop., 150: 103-109, 1980.
  13. Ritterbusch, J. F.; Shantharam, S. S.; and and Gelinas, C.: Comparison of lateral pillar classification and Catterall classification of Legg-Calvé-Perthes' disease. J. Pediat. Orthop., 13: 200-202, 1993.[Medline]
  14. Roethlisberger, F. J.; Dickson, W. J.; and Wright, H. A.: Management and the Worker. An Account of a Research Program Conducted by the Western Electric Company, Hawthorne Works, Chicago. Cambridge, Harvard University Press, 1966.
  15. Sharp, I. K.: Acetabular dysplasia. The acetabular angle. J. Bone and Joint Surg., 43-B(2): 268-272, 1961.
  16. Streiner, D. L., and Norman, G. R.: Generalizability Theory. Health Measurement Scales: A Practical Guide to Their Development and Use. Ed. 2, pp. 128-143. New York, Oxford University Press, 1995.
  17. Stulberg, S. D.; Cooperman, D. R.; and Wallensten, R.: The natural history of Legg-Calvé-Perthes disease. J. Bone and Joint Surg., 63-A: 1095-1108, Sept. 1981.[Abstract/Free Full Text]
  18. Wang, L.; Bowen, J. R.; Puniak, M. A.; Guille, J. T.; and Glutting, J.: An evaluation of various methods of treatment for Legg-Calvé-Perthes disease. Clin. Orthop., 314: 225-233, 1995.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Facebook Facebook   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
JBJSHome page
M. Ramachandran, K. Ward, R. R. Brown, C. F. Munns, C. T. Cowell, and D. G. Little
Intravenous Bisphosphonate Therapy for Traumatic Osteonecrosis of the Femoral Head in Adolescents
J. Bone Joint Surg. Am., August 1, 2007; 89(8): 1727 - 1734.
[Abstract] [Full Text] [PDF]


Home page
JBJSHome page
L. Barker, J. Anderson, R. Chesnut, G. Nesbit, T. Tjauw, and R. Hart
Reliability and Reproducibility of Dens Fracture Classification with Use of Plain Radiography and Reformatted Computer-Aided Tomography
J. Bone Joint Surg. Am., January 1, 2006; 88(1): 106 - 112.
[Abstract] [Full Text] [PDF]


Home page
JBJSHome page
J. A. Herring, H. T. Kim, and R. Browne
Legg-Calve-Perthes Disease. Part I: Classification of Radiographs with Use of the Modified Lateral Pillar and Stulberg Classifications
J. Bone Joint Surg. Am., October 1, 2004; 86(10): 2103 - 2120.
[Abstract] [Full Text] [PDF]


Home page
JBJSHome page
D. S. Bae, P. M. Waters, and D. Zurakowski
Reliability of Three Classification Systems Measuring Active Motion in Brachial Plexus Birth Palsy
J. Bone Joint Surg. Am., September 1, 2003; 85(9): 1733 - 1738.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
M. J. Berst, L. Dolan, M. M. Bogdanowicz, M. A. Stevens, S. Chow, and E. A. Brandser
Effect of Knowledge of Chronologic Age on the Variability of Pediatric Bone Age Determined Using the Greulich and Pyle Standards
Am. J. Roentgenol., February 1, 2001; 176(2): 507 - 510.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF) Free
Right arrow Letters to the Editor: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Letters to the Editor are posted
Right arrow Alert me if a correction is posted
Services
Right arrow E-mail this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My File Cabinet
Right arrow Download to citation manager
Right arrow Rights and Permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by NEYT, J. G.
Right arrow Articles by SATERBAK, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by NEYT, J. G.
Right arrow Articles by SATERBAK, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Facebook   Add to Technorati   Add to Twitter  
What's this?