What does it mean for educators if an AI chatbot can pass a physics class? Intrigued by the debate around large language models in the academic world, Gerd Kortemeyer, now the director of educational development and technology at ETH Zurich, Switzerland, decided to put ChatGPT to the test—literally. He reports that, based on its responses to actual homework and exams for a calculus-based physics course, GPT-3.5 would indeed have managed a (barely) passing grade (Phys. Rev. Phys. Educ. Res., doi: 10.1103/PhysRevPhysEducRes.19.010132).
For some, this result may cause concern about academic integrity. But Kortemeyer, who taught introductory physics in the United States for many years, isn’t particularly worried about the potential of AI to enable cheating. Instead, he says, it raises questions about how we’re actually teaching and assessing physics students. OPN talked with Kortemeyer about his thoughts on the future of physics education and GPT, and the “inherently human” skills he hopes to impart in his classes.
Was there anything that surprised you in your results?
Gerd Kortemeyer: I shouldn’t have found it surprising, but I did initially, just how close the bot’s mistakes were to the mistakes that real students make. I shouldn’t be surprised because I don’t know exactly what body of text it was trained with—it may have even contained some discussion forums on physics.
But I was surprised that something that is nothing more than a probabilistic autocomplete mechanism, basically pattern matching, would mimic the behavior of students. That makes me question what we have been testing all this time.
Are we actually testing pattern matching? Are we giving physics grades based on pattern matching, if such an algorithm can do okay?
What do you think that says about the emphasis on standardized testing?
If a student gets drilled toward these standardized assessments, they are kind of turned into a little robot. You’ve basically trained those kids to do as well as a machine.
OpenAI published a paper showing how GPT does on standardized assessments. For the ACT, SAT, all of those—it finishes in the upper percentiles. So that means that if a student gets drilled toward these standardized assessments, they are kind of turned into a little robot. You’ve basically trained those kids to do as well as a machine. And that scares me.
The grade in introductory physics courses is also based on very standard assessments: exams, homework, programming projects and clicker questions. So I shouldn’t be so surprised that ChatGPT passed, because in the end, solving these introductory physics problems is a very algorithmic kind of thing.
Is it still valuable to test students on these concepts, if they’ll have access to AI that can provide the answers?
Well, students still need to be able to do these problems. If you want to do any kind of advanced physics, all of these basic concepts like Newton’s laws, circuit laws, and so forth—you have to have that working knowledge right at the back of your mind. Because if you can’t pull from that knowledge immediately, you won’t be able to advance in physics. So even though AI can do the basic physics, I need to be able to assess that the students can still do it, too. And the AI tools are pretty much for sure going to fail in anything that is really advanced physics.
So we still need to assess these things, but not solely. As artificial intelligence gets better, we also need to focus on the skills that are inherently human. What is human intelligence? What is human creativity? In my classes, I always try to teach a little bit more than just memorizing facts. I want people to be curious about physics, to think critically, to apply these principles to everyday situations. I want to have all these kinds of metacognitive processes running.
Can you give an example of the application of these human skills?
Say I give a homework problem and in the end, your answer is that a car is moving at 4000 miles per hour. As a human, you look at that and you say, “That’s probably 40 miles an hour. Let’s go back. What did I do here?” Humans have the ability to ask: is this even realistic?
I can’t say never, but artificial intelligence is far from discovering anything new. Because it doesn’t question itself. It doesn’t question nature.
If I get the same crazy answer from an AI, it just goes ahead with that answer. When it’s wrong, it’s wrong by two orders of magnitude and it’s like, “So what? That’s my result.” Humans have that skill to constantly think in the background, can this even be true? It’s a completely different way to evaluate your answer; it’s not the way that got you to the solution. AI currently has none of these processes. It just chugs through and pops out a result.
Looking at a problem, doing calculations, and thinking, huh, that’s interesting—how did that happen? I can’t say never, but artificial intelligence is far from discovering anything new. Because it doesn’t question itself. It doesn’t question nature.
Is there a way to assess things like metacognition and curiosity?
I think the only way to assess them is with a much longer, larger-scale kind of assessment. The ultimate example of that is your doctoral thesis. A person works for years, researching something that is, up to that point, completely unknown. If you don’t have the curiosity, metacognition, foundational knowledge, all these tools—you won’t be able to finish a doctoral thesis in physics.
So how do you scale that down to the lower levels? The only way that I could find in my classes was to assign longer projects. I had students do things like make “MythBusters” videos, where they picked an urban myth to research and put together some explanation of the actual physics behind it. I had one group ask the question, if you are in a bar fight, is it better to get a full or empty bottle smashed over your head?
So they built a little machine with a billiard ball “head” on a spring “neck,” put an acceleration sensor on the billiard ball, made a swing arm that they could attach bottles to, smashed the bottles against the ball, and measured the acceleration. They found out that the worst thing that can happen is that the bottle doesn’t break because the energy doesn’t dissipate. And they documented the physics of it nicely.
That’s where you can separate the people who are just doing pattern matching from the people who are genuinely interested. And sometimes, students who might not be 4.0, perfect students on traditional assessments thrive in this environment. It’s a different dimension of doing science. It just takes a lot more time, and grading it is of course more subjective.
Do you think that in general, there’s too much focus on grades?
I taught a lot of premed students in the United States, and for them, the message was: if your course grade isn’t a 4.0, you’re a failure. And actually, a medical school admissions director who was visiting our college said, “If you only have a 3.5 in physics, you better have a good reason why.” And I thought, you’ve got to be kidding. A 3.5 is a great grade in physics.
The joy is being drained out of the whole thing. I’m a physicist because I enjoy it. The students in class—so many of them did not enjoy the experience.
The joy is being drained out of the whole thing. I’m a physicist because I enjoy it. The students in class—so many of them did not enjoy the experience. I can make my little jokes, I can try to be entertaining, I can try to make things crash and break. But in the end, they know that what’s going to count is if they have a 4.0 or a 3.5. And medical schools should consider—once an AI can get a 4.0 in the right courses, are we going to admit that thing to medical school? Why not? What’s missing?
If we take all the fun out of it, and make it so grade- and standard assessment−oriented, then we have reduced students to the level of artificial intelligence. That’s a danger here.
Do you think academic institutions will make policy changes as a result of the wide availability of this kind of AI?
At least at ETH Zurich, we try to really keep ourselves from making quick judgments and immediately implementing regulations. Some universities have immediately jumped to outlaw it and say, “Not a single word that was generated by AI may end up in something that is being assessed. It’s plagiarism; it’s ghost writing.” They’re basically applying terms from the past to this new technology, and then going the next step and saying that’s why it’s forbidden.
We wrote a blog article about AI not being a pandemic. When COVID-19 hit, we immediately came up with rules and regulations because we had to—it was a deadly pandemic. And we just try to caution our university against seeing AI the same way and immediately coming up with rules and regulations before having figured out what it actually is.
If artificial intelligence is available during exams, the real problem is not talking to the artificial intelligence—the real problem is talking to other people. The moment that you make artificial intelligence available as a cloud service, the students could also talk to each other. That would be a much, much more efficient way of cheating than working with an AI. If I can see an authoritative answer from my professor friend, why would I trust a probabilistic answer from an AI? So that is actually the bigger hurdle. With artificial intelligence comes internet connectivity and human communication.
How do we strike a balance between assessing foundational knowledge and asking students to work without resources, like AI and advanced calculators, that they’ll have in the real world?
At ETH, we have these huge assessments that go on for hours, and we’re thinking of having them in two parts. So maybe the first part is completely paper and pencil. No pocket calculators, nothing. And that’s how we assess the foundational knowledge, by taking all of that away.
And then the second part is much more advanced problems, and they can work like they would work in real life. You have all the tools at your disposal, and that’s not just AI, that’s stuff like Wolfram Alpha, statistics tools, all the stuff you have on your laptop.
The only thing that’s still problematic is interpersonal communication. I mean, everything is collaborative—maybe you can have group exams. But your physics professor friend, that’s probably where we need to draw a line.
What do you see as the biggest threats of AI?
The biggest threat I see that people blindly believe the outcome of artificial intelligence. Critical questioning of what comes out of AI is something that people just haven’t learned.
The biggest threat I see that people blindly believe the outcome of artificial intelligence. Critical questioning of what comes out of AI is something that people just haven’t learned. It spews out stuff that sounds oh, so plausible. Everything that it says sounds like just the absolute truth, there are no qualifiers. Even though the whole algorithm is completely probabilistic, it doesn’t give you a probability of being correct.
If people don’t question what comes out of out of AI, it could literally lead to disaster. There have been airplane crashes because the pilots didn’t really even know how to fly the plane anymore and didn’t question the computer output even when it was wrong.
So that blind trust, amplified by social media, allows anything to be blasted out to the world in no time. And then plausible fiction—which is what AI produces at the moment—becomes fact. And if that same fiction feeds into the next text corpus, the next training data, we are getting further and further away from whatever is truth. So that’s the biggest challenge at the moment.
And what about opportunities?
I see opportunity in people using it as a tool. So for instance, overcoming writer’s block. You tell it to write an essay about whatever topic. Then it produces its nice plausible fiction, which can be a good starting point. Then you start modifying it, correcting it, changing things you don’t agree with—but sometimes modifying is so much easier than starting from scratch.
That, of course, raises the question—are there still little snippets of text that came directly out of ChatGPT? Probably yes. Is this now plagiarism or ghost writing? I honestly can say I made the piece of text my own, but there are probably three or four words in the same order that came out of ChatGPT. Should that be forbidden? I don’t think so. Plagiarism is claiming the work of others as your own. I think this can still be considered my own work. I’m using AI as a tool, in the same way that I would use DeepL or Grammarly to translate or correct a big block of text.
It’s a great way of getting many different viewpoints on a subject, which are, after all gathered from a large text corpus. So you have a spectrum of opinions and ideas about a topic. You still have to work through them, but no Google search is going to give you that.
It can also answer very specialized questions. For a recent paper, I needed to do a certain kind of plot in Excel. I Googled for half an hour and couldn’t figure out how to make the thing. I put one sentence into ChatGPT, and it gave me the recipe for making the right plot. And it was so efficient; it just answered exactly the question.
So as a tool, it can be great. I use it very regularly in all kinds of ways.
Have we already seen improvements in GPT-4? What are you looking at next?
Definitely—I’ve tried some things with GPT-4, and that will be in the 80% range for its course grade. That’s a fairly decent grade in physics.
I’ve tried some things with GPT-4, and that will be in the 80% range for its course grade. That’s a fairly decent grade in physics.
The next frontier for me is to play with the multimodal input. Physics problems quite often come with little sketches, so trying to directly feed those images into the system rather than narrating what’s in the picture.
I’ve also been studying it for grading purposes. I just took a whole bunch of derivations of problem solutions and had ChatGPT grade them on a rubric. That will get you an R2 of over 0.8. So it’s actually kind of promising. It’s all not quite there, but it’s close to being there.