Measuring the Quality of Teacher-Constructed English Test as Final Examination through Item Response Theory

Authors

  • Novri Pahrizal IAIN Kerinci
  • Lian G. Otaya IAIN Sultan Amai Gorontalo

Keywords:

teacher-constructed test, quality of test, item response theory.

Abstract

This study aimed to examine the psychometric quality of a teacher-constructed English final examination test for Grade X students at Senior High School in Sungai Penuh using the framework of Item Response Theory (IRT). The analysis focused on evaluating model fit, item difficulty, and item discrimination parameters across the 1-Parameter Logistic (1-PL) and 2-Parameter Logistic (2-PL) models. Data was collected from students’ responses to 40 multiple-choice items and analyzed using RStudio. The goodness-of-fit results revealed that the 2-PL model provided a better representation of the data, with 36 items classified as fit and only 3 misfitting, compared to the 1-PL model where 32 items fit and 8 misfits. Furthermore, the difficulty parameter (b) indicated that all items were within the acceptable range (–2 ≤ b ≤ +2), with a tendency toward easy to moderate levels. The discrimination parameter (a) demonstrated that most items possessed satisfactory to high discrimination power, although a small number exhibited lower values. These findings confirm that the teacher-constructed test generally meets psychometric standards of validity and reliability, while also highlighting the need for revision of a few misfitting and low-discrimination items. The study provides both theoretical and practical contributions by emphasizing the importance of applying IRT in school-based assessment practices to ensure fair, accurate, and effective evaluation of students’ learning outcomes.

Author Biography

Lian G. Otaya, IAIN Sultan Amai Gorontalo

This study aimed to examine the psychometric quality of a teacher-constructed English final examination test for Grade X students at Senior High School in Sungai Penuh using the framework of Item Response Theory (IRT). The analysis focused on evaluating model fit, item difficulty, and item discrimination parameters across the 1-Parameter Logistic (1-PL) and 2-Parameter Logistic (2-PL) models. Data was collected from students’ responses to 40 multiple-choice items and analyzed using RStudio. The goodness-of-fit results revealed that the 2-PL model provided a better representation of the data, with 36 items classified as fit and only 3 misfitting, compared to the 1-PL model where 32 items fit and 8 misfits. Furthermore, the difficulty parameter (b) indicated that all items were within the acceptable range (–2 ≤ b ≤ +2), with a tendency toward easy to moderate levels. The discrimination parameter (a) demonstrated that most items possessed satisfactory to high discrimination power, although a small number exhibited lower values. These findings confirm that the teacher-constructed test generally meets psychometric standards of validity and reliability, while also highlighting the need for revision of a few misfitting and low-discrimination items. The study provides both theoretical and practical contributions by emphasizing the importance of applying IRT in school-based assessment practices to ensure fair, accurate, and effective evaluation of students’ learning outcomes.

References

Arifin, W. and Yusoff, M. (2017). Item response theory for medical educationists. Education in Medicine Journal, 9(3), 69-81. https://doi.org/10.21315/eimj2017.9.3.8

Baker, F. B. (2001). The Basics of Item Response Theory (2nd ed.). ERIC Clearinghouse on Assessment and Evaluation

Basuki, L. and Anggoro, S. (2021). Improving the competency to construct test items for class vi teachers through workshop.. https://doi.org/10.4108/eai.19-7-2021.2312716

Bichi, A. and Talib, R. (2018). Item response theory: an introduction to latent trait models to test and item development. International Journal of Evaluation and Research in Education (IJERE), 7(2), 142. https://doi.org/10.11591/ijere.v7i2.12900

Brown, H. D., & Abeywickrama, P. (2010). Language assessment: Principles and classroom practices (2nd ed.). Pearson.

Brown, H. Douglas, Abeywickrama, Priyanvada. (2010). Language Assessment : Principles and Classroom Practices (3). New York: Pearson Education.

Cao, Y., Lu, R., & Wei, T. (2014). Effect of item response theory (irt) model selection on testlet‐based test equating. Ets Research Report Series, 2014(2), 1-13. https://doi.org/10.1002/ets2.12017

Cohen, L., Manion, L., & Morrison, K. (2002). Research methods in education. routledge.

Creswell, J. W., & Creswell, J. D. (2016). Research design: Qualitative, quantitative, and mixed methods approaches. Sage publications.

Curtis, S. (2010). bugscode for item response theory. Journal of Statistical Software, 36(Code Snippet 1). https://doi.org/10.18637/jss.v036.c01

de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465-485.

Gavett, B. and Horwitz, J. (2011). Immediate list recall as a measure of short-term episodic memory: insights from the serial position effect and item response theory. Archives of Clinical Neuropsychology, 27(2), 125-135. https://doi.org/10.1093/arclin/acr104

Hambleton R.K. & Swaminathan H. (1985). Items Response Theory: Principles and Application. Kluwer-Nijjhoff Publish.

Hambleton, Ronald K; Swaminathan, H; dan Jane Rogers, H. 1991. Fundamentals of Item Response Theory. London: SagePublications

Jahrami, H. (2025). The validation of the nomophobia questionnaire using a modern psychometric approach: an item response theory analysis of 5087 participants. Brain and Behavior, 15(6). https://doi.org/10.1002/brb3.70622

Kang, T. and Chen, T. (2010). Performance of the generalized s-x2 item fit index for the graded response model. Asia Pacific Education Review, 12(1), 89-96. https://doi.org/10.1007/s12564-010-9082-4

Krisna, I. I., Mardapi, D., & Azwar, S. (2016). Determining standard of academic potential based on the Indonesian Scholastic Aptitude Test (TBS) benchmark. REID (Research and Evaluation in Education), 2(2), 5.

Mardapi, D. (2017). Pengukuran, penilaian, dan evaluasi pendidikan. Yogyakarta: Parama Publishing.

Mardapi, D. (2017). Pengukuran, Penilaian, dan Evaluasi Pendidikan (Edisi 2). Yogyakarta: Parama Publishing.

Maydeu‐Olivares, A. (2013). Goodness-of-fit assessment of item response theory models. Measurement Interdisciplinary Research and Perspectives, 11(3), 71-101. https://doi.org/10.1080/15366367.2013.831680

Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 379-416.

Nitko, A. J., & Brookhart, S. M. (2014). Educational assessment of students (6 th ed.). Pearson Education, Inc

Ohiri, S. and Okoye, R. (2023). Application of classical test theory as linear modeling to test item development and analysis. International Research Journal of Modernization in Engineering Technology and Science. https://doi.org/10.56726/irjmets45379

Popham, W. J. (2009). Classroom assessment: What teachers need to know (6th ed.). Pearson.

Price, L. R. (2017). Psychometric methods: Theory into practice. New York: Guilford Publications.

Reckase, M. D. (2009). Multidimensional item response theory models Multidimensional Item Response Theory (pp. 79-112): Springer.

Retnawati, H. (2014). Teori respons butir dan penerapannya: Untuk peneliti, praktisi pengukuran dan pengujian, mahasiswa pascasarjana. Nuha Medika

Retnawati, H. (2016). Analisis kuantitatif instrumen penelitian (panduan peneliti, mahasiswa, dan psikometrian). Parama publishing.

Reynolds, C. R., Livingston, R. B., Willson, V. L., & Willson, V. (2010). Measurement and assessment in education: Pearson Education International Upper Saddle River.

Stone, C. and Zhang, B. (2003). Assessing goodness of fit of item response theory models: a comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331-352. https://doi.org/10.1111/j.1745-3984.2003.tb01150.x

Sumintono, B. (2018). Rasch model measurements as tools in assesment for learning.. https://doi.org/10.2991/icei-17.2018.11

Uyigue, V. and Orheruata, M. (2019). Test length and sample size for item-difficulty parameter estimation in item response theory. JEP. https://doi.org/10.7176/jep/10-30-08

Wahyuni, L., Sarwanto, S., & Atmojo, I. (2024). Measurement of science literacy skills of elementary school teacher education students: development and validity testing of assessment instruments. International Journal of Current Science Research and Review, 07(11). https://doi.org/10.47191/ijcsrr/v7-i11-27

Widoyoko (2014). Teknik Penyusunan Instrumen Penelitian. Yogyakarta: Pustaka Pelajar.

Zanon, C., Hutz, C., Yoo, H., & Hambleton, R. (2016). An application of item response theory to psychological test development. Psicologia Reflexão E Crítica, 29(1). https://doi.org/10.1186/s41155-016-0040-x

Downloads

Published

2025-07-31

How to Cite

Pahrizal, N., & Otaya, L. G. (2025). Measuring the Quality of Teacher-Constructed English Test as Final Examination through Item Response Theory. Journal of English Teaching and Linguistic Issues (JETLI), 4(2), 45–64. Retrieved from https://ejournal.iaingorontalo.ac.id/index.php/JETLI/article/view/2998