• Contact us
  • E-Submission
ABOUT
BROWSE ARTICLES
JOURNAL POLICIES
FOR CONTRIBUTORS

Articles

Page Path

Review Article

Statistical Methods: Reliability Assessment and Method Comparison

The Ewha Medical Journal 2017;40(1):9-16. Published online: January 31, 2017

Clinical Trial Center, Ewha Womans University Mokdong Hospital, Seoul, Korea.

Corresponding author: Kyoung Ae Kong. Clinical Trial Center, Ewha Womans University Medical Center, 1071 Anyangcheon-ro, Yangcheon-gu, Seoul 07985, Korea. Tel: 82-2-2650-2069, Fax: 82-2-2650-6141, kkong@ewha.ac.kr
• Received: December 29, 2016   • Accepted: January 4, 2017

Copyright © 2017. Ewha Womans University School of Medicine

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

  • 75 Views
  • 0 Download
  • 35 Crossref
prev next
  • The reliability of clinical measurements is critical to medical research and clinical practice. Newly proposed methods are assessed in terms of their reliability, which includes their repeatability, intra- and interobserver reproducibility. In general, new methods that provide repeatable and reproducible results are compared with established methods used clinically. This paper describes common statistical methods for assessing reliability and agreement between methods, including the intraclass correlation coefficient, coefficient of variation, Bland-Altman plot, limits of agreement, percent agreement, and the kappa statistic. These methods are more appropriate for estimating reliability than hypothesis testing or simple correlation methods. However, some methods of reliability, especially unscaled ones, do not clearly define the acceptable level of error in real size and unit. The Bland-Altman plot is more useful for method comparison studies as it assesses the relationship between the differences and the magnitude of paired measurements, bias (as mean difference), and degree of agreement (as limits of agreement) between two methods or conditions (e.g., observers). Caution should be used when handling heteroscedasticity of difference between two measurements, employing the means of repeated measurements by method in methods comparison studies, and comparing reliability between different studies. Additionally, independence in the measuring processes, the combined use of different forms of estimating, clear descriptions of the calculations used to produce indices, and clinical acceptability should be emphasized when assessing reliability and method comparison studies.
  • 1. Korean Society for Preventive Medicine.Preventive medicine and public health; 2nd ed. Seoul: Gyechuk Munwhasa; 2013.
  • 2. Szklo M, Nieto FJ. Epidemiology: beyond the basics; 2nd ed. Sudbury, MA: Jones and Bartlett Publishers; 2007.
  • 3. Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med 1998;26:217-238.
  • 4. Bartlett JW, Frost C. Reliability, repeatability and reproducibility: analysis of measurement errors in continuous variables. Ultrasound Obstet Gynecol 2008;31:466-475.
  • 5. Petrie A, Sabin C. Medical statistics at a glance; 3rd ed. Chichester, UK: John Wiley & Sons; 2009.
  • 6. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135-160.
  • 7. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420-428.
  • 8. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6:284-290.
  • 9. Rosner B. Fundamentals of biostatistics; 7th ed. Boston, MA: Duxbury Press; 2006.
  • 10. Hirschmann MT, Konala P, Amsler F, Iranpour F, Friederich NF, Cobb JP. The position and orientation of total knee replacement components: a comparison of conventional radiographs, transverse 2D-CT slices and 3D-CT reconstruction. J Bone Joint Surg Br 2011;93:629-633.
  • 11. Kim CH, Chung CK, Hong HS, Kim EH, Kim MJ, Park BJ. Validation of a simple computerized tool for measuring spinal and pelvic parameters. J Neurosurg Spine 2012;16:154-162.
  • 12. Donner A, Zou G. Testing the equality of dependent intraclass correlation coefficients. J R Stat Soc Ser D Stat 2002;51:367-379.
  • 13. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-310.
  • 14. Bland JM, Altman DG. Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol 2003;22:85-93.
  • 15. Johnsson AA, Fagman E, Vikgren J, Fisichella VA, Boijsen M, Flinck A, et al. Pulmonary nodule size evaluation with chest tomosynthesis. Radiology 2012;265:273-282.
  • 16. Bland M. Correction to section “Measuring agreement using repeated measurements” in Bland and Altman (1986) [Internet] 2009;07 03 cited 2016 Dec 19. Available from: https://www.users.york.ac.uk/~mb55/meas/repeated.htm
  • 17. Hanneman SK. Design, analysis, and interpretation of method-comparison studies. AACN Adv Crit Care 2008;19:223-234.
  • 18. Bruton A, Conway JH, Holgate ST. Reliability: what is it, and how is it measured? Physiotherapy 2000;86:94-99.
  • 19. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-174.
  • 20. Fleiss JL. Statistical methods for rates and proportions; 2nd ed. New York, NY: John Wiley and Sons; 1981.
  • 21. Altman DG. Practical statistics for medical research; London, UK: Chapman & Hall/CRC; 1991.
  • 22. Fleiss JL, Levin B, Paik MC. Statistical methods for rates and proportions; 3rd ed. Hoboken, NJ: John Wiley & Sons; 2003.
  • 23. StataCorp.STATA base reference manual (release 13); College Station, TX: Stata Press; 2013.
Fig. 1

Intraclass correlation coefficient and Pearson's correlation coefficient as indices for intra- or interobserver reliability. ICC, intraclass correlation coefficient; correlation coefficient, Pearson's correlation coefficient.

emj-40-9-g001.jpg
Fig. 2

Graphical presentation of agreement. A case where the greater magnitude of measurements has the greater difference.

emj-40-9-g002.jpg
Fig. 3

Graphical presentation of agreement. A case where an increase in the variability of the differences is based on an increase in the magnitude of measurements.

emj-40-9-g003.jpg
Fig. 4

Measurements of pulmonary nodule size using two radiological methods (shown is a Bland-Altman plot).

emj-40-9-g004.jpg
Table 1

Agreement between observers A and B on binary measurements

emj-40-9-i001.jpg
Table 2

Agreement between methods A and B on measurements with four-category results

Number in parentheses indicates the weight used for calculation of the weighted kappa.

emj-40-9-i002.jpg

Figure & Data

References

    Citations

    Citations to this article as recorded by  
    • Analysis of the Reliability of Feather Sections for Corticosterone Measurement in Pekin Ducks
      Se-Jin Lim, Chan Ho Kim, Ka Young Yang, Woo Do Lee, Su Mi Kim, Yang-Ho Choi, Jung Hwan Jeon
      Animals.2025; 15(2): 138.     CrossRef
    • The dynamics of perceived justice and its outcomes in the online tourism sector: inter-relationships and temporal and carryover effects
      Kowoon Kim, Hong-Youl Ha
      Current Issues in Tourism.2024; 27(24): 4740.     CrossRef
    • Accuracy of a Smart Diaper System for Nursing Home Residents for Automatically Detecting Voided Volume: Instrument Validation Study
      Jae Heon Kim, Ui Cheol Lee, Byeong Hun Jeong, Byeong Uk Kang, Sung Ryul Shim, In Gab Jeong
      JMIR Formative Research.2024; 8: e58583.     CrossRef
    • Translation and validation of the Korean Version of the Reflux Symptom Score
      Hye Kyu Min, So Young Jeon, Jerome R Lechien, Jung Min Park, Hwanhee Park, Jung-wan Yu, Suk Kim, Su Jin Jeong, Jung wook Kang, Kim Su il, Lee Young chan, Young-Gyu Eun, Seong-Gyu Ko
      Journal of Voice.2024; 38(2): 545.e1.     CrossRef
    • Development and validation of a semi-quantitative food frequency questionnaire as a tool for assessing dietary vitamin D intake among Korean women
      Hye Ran Shin, SuJin Song, Sun Yung Ly
      Nutrition Research and Practice.2024; 18(6): 872.     CrossRef
    • The Risk of Contamination according to Hairstyle during Aseptic Procedures in Nursing Students: An Observational Study
      Se Young Lim, Eun Jung Kim
      Journal of Korean Academy of Fundamentals of Nursing.2024; 31(3): 295.     CrossRef
    • A Novel Dental Plaque Index Using Intraoral Camera Images
      Ji-Soo Kim
      Journal of Dental Hygiene Science.2024; 24(3): 200.     CrossRef
    • Test–retest Reliability and Concurrent Validity of a Headphone and Necklace Posture Correction System Developed for Office Workers
      Gyu-hyun Han, Chung-hwi Yi, Seo-hyun Kim, Su-bin Kim, One-bin Lim
      Physical Therapy Korea.2023; 30(3): 174.     CrossRef
    • Clinical Evaluation of a Rapid Diagnostic Test Kit for Canine Parvovirus and Coronavirus
      Chaeyeong MIN, Won-Shik KIM, Chom-Kyu CHONG, Yong LIM
      Korean Journal of Clinical Laboratory Science.2023; 55(1): 45.     CrossRef
    • A Study on the Validity and Test-retest Reliability of the Measurement of the Head Tilt Angle of the Smart Phone Application ‘KPIMT Torticollis Protractor’
      Seong Hyeok Song, Ji Su Park, Ki Yeon Song, Ki Hyun Baek, Seung Hak Yoo, Ju Sang Kim
      The Journal of Korean Physical Therapy.2023; 35(6): 177.     CrossRef
    • Comparison of Corneal Higher-order Aberrations Measured by Scheimpflug Camera and Placido Disc-based Topography in Korean Patients
      Yeon Ju Lim, Do Hee Jung, Kang Yeun Pak, Chan-Ho Cho
      Journal of the Korean Ophthalmological Society.2023; 64(12): 1141.     CrossRef
    • Writing Development of Children before Entering Primary School: Focusing on Graphomotor Skills and Written Expression
      Boram No, Naya Choi
      Korean Journal of Child Studies.2022; 43(1): 47.     CrossRef
    • Clinical Accuracy of Non-Contact Forehead Infrared Thermometer Measurement in Children: An Observational Study
      Yeon-Mi Kim, Myung-Roul Jang, Ju-Ryoung Moon, Goeun Park, Ye-Jin An, Jeong-Meen Seo
      Children.2022; 9(9): 1389.     CrossRef
    • A Prototype of a Stereoacuity Test Using a Head-Mounted Display
      Hyuna Cho, Hyosun Kim, Rang Kyun Mok, Sung Eun Park, Wungrak Choi, Sueng-Han Han, Jinu Han
      Journal of the Korean Ophthalmological Society.2022; 63(3): 301.     CrossRef
    • Lumbar Spine Computed Tomography to Magnetic Resonance Imaging Synthesis Using Generative Adversarial Network: Visual Turing Test
      Ki-Taek Hong, Yongwon Cho, Chang Ho Kang, Kyung-Sik Ahn, Heegon Lee, Joohui Kim, Suk Joo Hong, Baek Hyun Kim, Euddeum Shim
      Diagnostics.2022; 12(2): 530.     CrossRef
    • Assessing Agreement Between Upright and Supine Head Roll Tests for Horizontal Semicircular Canal Benign Paroxysmal Positional Vertigo
      Tae Ho Kim, Jae Sang Han, Jae Hong Han, Dong-Hee Lee, Yeonji Kim, Shi Nae Park, Kyoung-Ho Park, Jae-Hyun Seo
      Korean Journal of Otorhinolaryngology-Head and Neck Surgery.2022; 65(9): 497.     CrossRef
    • Effect of Unmeasured Time Hours on Occupational Noise Exposure Assessment in the Shipbuilding Process in Korea
      Jaewoo Shin, Seokwon Lee, Kyoungho Lee, Hyunwook Kim
      International Journal of Environmental Research and Public Health.2021; 18(16): 8847.     CrossRef
    • Repeatability of Bruch’s Membrane Opening-minimum Rim Width in Age-related Macular Degeneration and Diabetic Macular Edema
      Bum Jun Kim, Woo Hyuk Lee, Ki Yup Nam, Ji Hye Kim, Tae Seen Kang, Hyun Kyung Cho, Yong Seop Han
      Journal of the Korean Ophthalmological Society.2021; 62(11): 1490.     CrossRef
    • Reliability and Validity of an Ultrasonic Device for Measuring Height in Adults
      Seon Hwa Cho, Young Gyu Cho, Hyun Ah Park, A Ra Bong
      Korean Journal of Family Medicine.2021; 42(5): 376.     CrossRef
    • A Health Information Quality Assessment Tool for Korean Online Newspaper Articles: Development Study
      Naae Lee, Seung-Won Oh, Belong Cho, Seung-Kwon Myung, Seung-Sik Hwang, Goo Hyeon Yoon
      Journal of Medical Internet Research.2021; 23(7): e24436.     CrossRef
    • A Comparison of Central Corneal Thickness Measurements and Measurement Repeatability Using Three Imaging Modalities
      Sang Earn Woo, Si Hyung Lee
      Journal of the Korean Ophthalmological Society.2021; 62(2): 184.     CrossRef
    • Development and validation of prediction equations for the assessment of muscle or fat mass using anthropometric measurements, serum creatinine level, and lifestyle factors among Korean adults
      Gyeongsil Lee, Jooyoung Chang, Seung-sik Hwang, Joung Sik Son, Sang Min Park
      Nutrition Research and Practice.2021; 15(1): 95.     CrossRef
    • A Study on the Characteristics of Indoor Radon Concentration in Water Curtain Cultivation Facilities
      Sang-cheol Kim, Chan-ju Park, Jae-hyuk Choi, Min-a Seo, Jin-gyun Eom, Mi-sun Park
      Journal of Environmental Analysis, Health and Toxicology.2021; 24(2): 84.     CrossRef
    • Assessment of Repeatability and Reproducibility of Non-Invasive TBUT Measurement Using the Bland-Altman Plot
      Yee-Rin Jung, Hyung-Min Park, Byoung-Sun Chu
      Journal of Korean Ophthalmic Optics Society.2021; 26(4): 307.     CrossRef
    • Effect of depth of anesthesia on the phase lag entropy in patients undergoing general anesthesia by propofol
      Jae Hong Park, Sang Eun Lee, Eunsu Kang, Yei Heum Park, Hyun-seong Lee, Soo Jee Lee, Dongju Shin, Gyu-Jeong Noh, Il Hyun Lee, Ki Hwa Lee
      Medicine.2020; 99(30): e21303.     CrossRef
    • Reproducibility of abnormality detection on chest radiographs using convolutional neural network in paired radiographs obtained within a short-term interval
      Yongwon Cho, Young-Gon Kim, Sang Min Lee, Joon Beom Seo, Namkug Kim
      Scientific Reports.2020;[Epub]     CrossRef
    • Repeatability and Reproducibility of Tear Meniscus Evaluations Using Two Different Spectral Domain-optical Coherence Tomography
      Jin Ha Kim, Kyu Ryong Choi, Roo Min Jun, Kyung Eun Han
      Journal of the Korean Ophthalmological Society.2019; 60(10): 929.     CrossRef
    • Reliability and Validity of Non-invasive Blood Pressure Measurement System Using Three-Axis Tactile Force Sensor
      Sun-Young Yoo, Ji-Eun Ahn, György Cserey, Hae-Young Lee, Jong-Mo Seo
      Sensors.2019; 19(7): 1744.     CrossRef
    • Efficacy of the Mobile Three-Dimensional Wound Measurement System in Pressure Ulcer Assessment
      Dongkeun Jun, Hyungon Choi, Jeenam Kim, Myungchul Lee, Soonheum Kim, Dongin Jo, Cheolkeun Kim, Donghyeok Shin
      Journal of Wound Management and Research.2019; 15(2): 78.     CrossRef
    • Evaluation of Image Receptor Characteristics in Computed Radiography System Using Exposure Index in International Electrotechnical Commission (Ⅰ)
      Park Hyemin, Yoon Yongsu, Roh Younghoon, Kim Sungjun, Na Chanyoung, Han Taeho, Kim Jungsu, Jeong hoiwoun, Kim Jungmin
      Journal of Radiological Science and Technology.2019; 42(4): 291.     CrossRef
    • Short-term Reproducibility of Pulmonary Nodule and Mass Detection in Chest Radiographs: Comparison among Radiologists and Four Different Computer-Aided Detections with Convolutional Neural Net
      Young-Gon Kim, Yongwon Cho, Chen-Jiang Wu, Sejin Park, Kyu-Hwan Jung, Joon Beom Seo, Hyun Joo Lee, Hye Jeon Hwang, Sang Min Lee, Namkug Kim
      Scientific Reports.2019;[Epub]     CrossRef
    • Evaluating test-retest reliability in patient-reported outcome measures for older people: A systematic review
      Myung Sook Park, Kyung Ja Kang, Sun Joo Jang, Joo Yun Lee, Sun Ju Chang
      International Journal of Nursing Studies.2018; 79: 58.     CrossRef
    • Clinical Assessment of Cellulose Tube-Type Tear Test Kit
      Jung-Eun Park, Myeong-Jin Jeong, Koon-Ja Lee
      The Korean Journal of Vision Science.2018; 20(3): 305.     CrossRef
    • Cross-cultural Adaptation and Validation of the eHealth Literacy Scale in Korea
      Sun Ju Chang, Eunjin Yang, Hyunju Ryu, Hee Jung Kim, Ju Young Yoon
      Korean Journal of Adult Nursing.2018; 30(5): 504.     CrossRef
    • Comparison of the Utility of dnaJ and 16S rDNA Sequences for Identification of Clinical Isolates of Vibrio Species
      In-Sun Choi, Dae Soo Moon, Geon Park, Seong-Ho Kang, Choon-Mee Kim, Young-Joon Ahn, Dong-Min Kim, Na Ra Yun, Dong Hoon Lim, Sung Heui Shin, Joong-Ki Kook, Young-Hyo Chang, Sook-Jin Jang
      Laboratory Medicine Online.2018; 8(1): 7.     CrossRef

    Download Citation

    Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

    Format:

    Include:

    Statistical Methods: Reliability Assessment and Method Comparison
    Ewha Med J. 2017;40(1):9-16.   Published online January 31, 2017
    Download Citation
    Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

    Format:
    • RIS — For EndNote, ProCite, RefWorks, and most other reference management software
    • BibTeX — For JabRef, BibDesk, and other BibTeX-specific software
    Include:
    • Citation for the content below
    Statistical Methods: Reliability Assessment and Method Comparison
    Ewha Med J. 2017;40(1):9-16.   Published online January 31, 2017
    Close

    Figure

    • 0
    • 1
    • 2
    • 3
    Statistical Methods: Reliability Assessment and Method Comparison
    Image Image Image Image
    Fig. 1 Intraclass correlation coefficient and Pearson's correlation coefficient as indices for intra- or interobserver reliability. ICC, intraclass correlation coefficient; correlation coefficient, Pearson's correlation coefficient.
    Fig. 2 Graphical presentation of agreement. A case where the greater magnitude of measurements has the greater difference.
    Fig. 3 Graphical presentation of agreement. A case where an increase in the variability of the differences is based on an increase in the magnitude of measurements.
    Fig. 4 Measurements of pulmonary nodule size using two radiological methods (shown is a Bland-Altman plot).
    Statistical Methods: Reliability Assessment and Method Comparison

    Agreement between observers A and B on binary measurements

    Agreement between methods A and B on measurements with four-category results

    Number in parentheses indicates the weight used for calculation of the weighted kappa.

    Table 1 Agreement between observers A and B on binary measurements

    Table 2 Agreement between methods A and B on measurements with four-category results

    Number in parentheses indicates the weight used for calculation of the weighted kappa.

    TOP