Research Review: How Effective Is Utah Compose In Multi-Tiered System of Support (MTSS) Frameworks?

Research into Automated Writing Evaluation (AWE) has explored the potential applications of Utah Compose as a tool for identifying at-risk students, monitoring progress, and conducting scalable formative assessments that can be performed multiple times throughout the year. The evidence suggests that Utah Compose, either on its own or in combination with other AWE modes, could be equally accurate or even more accurate than the currently employed manual scoring methods for writing assessments. These traditional methods demand manual evaluation.

One notable advantage of AWE is its scoring consistency. Utah Compose’s scoring engine, known as Project Essay Grade (PEG), generates dependable data for tracking progress over time, particularly through benchmark assessments. In contrast, manual scoring is not only more time-intensive and costly, but also lacks uniformity in assessment across different evaluators, prompts, and writing genres/purposes.

The research inquiries revolve around identifying the most effective designs for screeners that utilize the PEG scoring engine. The goal is to obtain the most dependable data for making informed decisions within the context of Multi-Tiered System of Supports (MTSS) frameworks. These decisions can range from minor to major significance in terms of academic stakes.

The Key Findings that follow include reference to MI Write. Utah Compose uses the same technology infrastructure as MI Write and includes the same features and functionality.

Key Findings

MI Write enables screening that is equally accurate and more efficient than typical hand-scored screeners.

•    Rates of agreement between scores assigned by PEG and those assigned by multiple human raters tend to meet or exceed rates of agreement among only human raters. In a 2012 national competition among vendors, PEG was judged to be the highest-performing system for scoring extended response essays. [2]
•    A study of Grade 3–5 students using a six-prompt screener found acceptable classification accuracy for students at-risk for failing the Smarter Balanced English language arts assessment with as little as one prompt administered for Grade 3–4 students and 3 prompts for Grade 5 students. Classification accuracy using MI Write scoring features have been shown to be consistently higher than the Wechsler Individual Achievement Test | Third Edition (WIAT-III) and a better alternative to human-scored methods like the Curriculum-Based Measurement for Written Expression (CBM-W). [5]
•    MI Write meets the minimum standard for correct identification based on an average score across six prompts, with over 75% accurately classified as at-risk or not at-risk in random selections of average PEG scores. [5]
•    After prior state test performance, the PEG total score was the strongest predictor of Grade 6 students at-risk for failing a state writing assessment. [4]


Screening requires different approaches to evaluate struggling and non-struggling writers. 

•    For formative assessment and class-wide assessment, struggling writers require more prompts (e.g., four prompts in each of three purposes/genres) to accurately screen. PEG may be less reliable than hand-scoring for making decisions and scoring struggling writers. This approach is not recommended for making decisions on academic placement or disability status, especially because PEG models in MI Write are trained to score a grade band while human raters may use rubrics to score individual grades. [1]
•    Reliable low-stakes decisions could be made by averaging scores from three prompts across writing purposes/genres for non-struggling writers and six prompts for struggling writers. [3]


1.    Chen, D., Hebert, M., & Wilson, J. (2022). Examining human and automated ratings of elementary students’ writing quality: A multivariate generalizability theory application. American Educational Research Journal, 59(6), 1122–1156.

2.    Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76.

3.    Wilson, J., Chen, D., Sandbank, M. P., & Hebert, M. (2019). Generalizability of automated scores of writing quality in grades 3–5. Journal of Educational Psychology, 111, 619–640.

4.    Wilson, J., Olinghouse, N. G., McCoach, D. B., Andrada, G. N., & Santangelo, T. (2016). Comparing the accuracy of different scoring methods for identifying sixth graders at risk of failing a state writing assessment. Assessing Writing, 27, 11–23.

5.    Wilson, J., & Rodrigues, J. (2020). Classification accuracy and efficiency of writing screening using automated essay scoring. Journal of School Psychology, 82, 123–140.