Because so much depends upon the evaluation of a student’s learning and the resulting grade, it is in everyone’s interest to try to make the evaluation system as free from irrelevant errors as possible. Borrowing ...
Face-to-face instructors who give in-class exams have a challenge when moving their courses online: How to ensure that students do not cheat on the exams by collaborating? Different methods have been developed to address the ...
Sometimes courses with large enrollments spawn useful innovations, and this study looked at one empirically. Large courses almost always mandate the use of multiple-choice tests, and incorporating quizzes in these courses can present sizeable logistical ...
Should students change their answers on multiple choice questions? Believe it or not, that question has been explored empirically rather at length. Is it an important enough query to merit quantitative analysis? In and of ...
Like many faculty members, I started teaching as a subject matter expert without a formal background in education. My examination questions were either based on questions I recall from my personal experience as an undergraduate ...
As many classes and exams migrate online, many professors are increasingly concerned about an uptick in cheating. Preliminary numbers indicate those concerns are valid. In August 2020, Derek Newton in The Hechinger Report disclosed that ...
My interest in making exams more about learning and less about grades continues. I’m also a realist: exams will always be about grades. But could they please be at least a bit more about learning? ...
I recently discovered a 2014 study that reported on student-generated multiple-choice questions. It was the results that really caught my attention: “We find that these first-year students are capable of producing very high quality questions ...
A few years ago, a student came to see me because she was having trouble passing the Praxis exam, which was delaying her student teaching and her ultimate career goal. She had taken the exam ...
In a 1992 article in College Teaching, authors Mealy and Host identify three types of students who report high levels of anxiety during exams; those who lack adequate study skills, those who can study but ...
This article previously appeared in the November 1993 issue of The Teaching Professor, where it was excerpted and reprinted with permission from The Center for Teaching Effectiveness Newsletter at the University of Texas at Austin.
Because so much depends upon the evaluation of a student’s learning and the resulting grade, it is in everyone’s interest to try to make the evaluation system as free from irrelevant errors as possible. Borrowing from the evaluation literature, I propose the four R’s of evaluation—Relevant, Reliable, Recognizable, Realistic—as ways to ensure the quality of our evaluation systems.
In the jargon this is known as the validity of an evaluation method. This means that any activity used to evaluate a student’s learning must be an accurate reflection of the skill or concept which is being tested. What are the characteristics of a relevant evaluation?
Oddly enough, one characteristic that might seem very mundane is that the evaluation activity must appear related to the course content (known in the jargon as face validity). A common student complaint is that tests are not related to the course content or what was presented in class. Although we know that what we assign is directly related to the course, the students often don’t see the connection. And, student impressions aside, the more obvious the connection, the higher the probability that we really have a valid evaluation activity.
A second characteristic of relevant evaluations is that they are derived directly from the objectives (known in the jargon as content validity). The most obvious way to achieve this is to follow the objectives as closely as possible in selecting activities.
If your objective is that the students will be able to select the appropriate statistic for analyzing a given set of data, the evaluation should provide them with a data set and have them select the analysis. It could take many forms:
All of these alternatives represent relevant tests of that objective.
Another characteristic of a relevant evaluation is ho well performance on that evaluation predicts performance on other closely related skills, either at the same time (concurrent validity) or in the future (predictive validity). If the skill you are supposedly testing should be highly correlated with some other skill which you are also testing, chart the students’ performances on each and see if they follow the same pattern.
To use a simplified example, we can say that the ability to add two single-digit numbers is a precursor to, and therefore highly correlated with, the ability to add two two-digit numbers. Therefore, students who do poorly on the former should not be able to do well on the latter. If they do, then one of the two tests is not measuring what it is supposed to be measuring and is therefore not relevant to the additional skill we are trying to evaluate.
The second aspect of an evaluation activity is how reliably or consistently it measures whatever it measures without being affected too much by the situation. A student’s grade should not hang on a single performance or on the mood of the person making the judgment. Of course, no system is perfectly reliable and will produce exactly the same evaluation of performance each time, but the goal here is to eliminate as many sources of error as possible.
The three biggest sources of error in reliably evaluating a student are:
Poor communication of expectations means that poor student performance may be the result of the student’s failure to correctly interpret the task requirements. In written exams this usually is caused by ambiguous questions, unclear instructions, corrections given verbally during the test, and so on. In each case, a bad grade is the result of the student not understanding the question. The student may in fact know the material.
Lack of consistent criteria for judgment means that, if the same performance were to be judged a second time by the same grader, or if another grader evaluated it, it might not receive the same grade because the basis for judging was unclear. The clearer the criterion for judging a student’s performance, the more reliable the evaluation becomes.
For example, one real strength of multiple-choice tests is that the grading is very reliable. Either the students marked the correct answer or they didn’t; very little is left to the judgment of the grader. On the other hand, essay tests are notoriously unreliable unless the instructor takes pains to make the criteria explicit and keeps checking to make sure he or she is not straying too far from the preset criteria.
Lack of sufficient information is the third source of error in evaluating students, not just in terms of the amount of information, but also in terms of variety of information sources. Not everyone excels in every format. Using only one format may introduce a source of bias for or against some students and lower the reliability of an evaluation.
Our third R is the need for the evaluation system to be recognizable to the students. By this we mean that students should be aware of how they will be evaluated and their class activities should prepare them for those evaluations. Testing should not be a game of “Guess what I’m going to ask you.”
Students don’t mind “hard” tests as long as there are no surprises and they can recognize the relationship of the test to the course. Some instructors may criticize this as “teaching the test,” but in reality the test should be the best statement of the course expectations and therefore should mirror the teaching. Furthermore, few courses are taught at such a low level that tests are verbatim transcripts of the class or text; rather they are interpretations or new examples of the class or text material.
All of the above activities require work, on the part of either the students or the teacher. So, to avoid burning out either, the final R is that the evaluation system should be realistic: the amount of information obtained is balanced by the amount of work required. Too often we forget that our students are taking three to four other courses along with ours.
What is realistic? Unfortunately, no one can give a blanket answer to that question. I can say that several smaller assignments tend to be more valuable than one large assignment. Alternatively, if a large assignment is called for, spreading it out across the semester and requiring components to be handed in periodically is a good technique, both pedagogically and administratively.