GPT-4 Takes a New Midterm and Gets an A

Apr 3, 2023

Error

43 Comments

Apr 3, 2023

The wonder is not GPT's performance but that it was done under such a capricious and arbitrary grading scheme. Once again I am grateful never to have faced one of Caplan's tests.

Expand full comment

On the contrary, I would have been delighted to have been able to get an A with such low scores. A in exams I took meant getting 70%, or 85%.

Expand full comment

Like (3)

JSwiffer

Apr 4, 2023

I thought the same thing, but I think these questions are based on texts that are missing from blog post. This seems closer to a reading comprehension test, "what does the author (Bryan Caplan) think about x?"

Expand full comment

robc

Apr 3, 2023

Probably a curve. I had plenty of tests like that.

My favorite was a prof who the mean was the B/C borderline and a grade was 1 SD wide. So above 1 SD above mean was an A, between 0-1 was a B, 0 to -1 was a C, -1 to -2 was a D, below -2 SD was an F.

First test in class I got something like a 92 and it was a B+. Second test, 58 was an A-. Third test the mean was about the same as second test and I scored an 80-something so locked up my A with ease.

A system like that is neither capricious nor arbitrary, its just non-standard.

Expand full comment

honeypuppy

Apr 3, 2023

>It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.

This is just blatantly docking points for disagreeing with your politics. It's easy to justify an UBI experiment as an EA even if you think UBI wouldn't be the best thing to *personally* fund, if you think UBI has a good chance of being a significant improvement over the status quo and an experiment might be a low-cost way of helping bring about that change. EAs don't have to *just* be about malaria nets for the very poor.

I also concur with the sentiments of the other posters that the questions and answers are not that difficult but obviously loaded to make a point about your politics.

I wonder if the next generation of AI will seek to do well on tests by predicting how to flatter the graders.

Expand full comment

Funnily enough, the success of GPT4 is pushing me in a more pro-UBI direction. I've always maintained that UBI is not, as its supporters generally claim, an efficient use of money in the current economy; either it doesn't give enough money to people who need it, or it wastes vast sums on people who don't need it.

But in an economy with mass unemployment from automation, that calculus changes. Suddenly if most people really need state support, while AI is delivering untold riches to a select few, then UBI becomes a very good idea. And that scenario might not be all that far off.

Expand full comment

Like (2)

Matt

Apr 3, 2023

Bryan, is this going to make you update your thoughts on AI safety?

If 'capabilities' is 5 years ahead of what you expected (this should be a huge shock to your model of AI) would you be worried if AI safety/alignment is 5 years+ behind capabilities?

Expand full comment

Like (11)

Timothy

Apr 3, 2023

Apart from all the AI stuff, the exam seems so weird an biased. First of all, the questions all seem really easy, I feel like I could pass the exam by reading 7 random Caplan blog posts. And they seem so biased. Especially number two. Politics is the mind killer (https://www.lesswrong.com/posts/9weLK2AJ9JEt2Tt8f/politics-is-the-mind-killer) if you want people to think rationally or to learn something, don't use politics except if you have to. Six is also kind of weird but I do like the mention of EA.

But I have never seen another Public Policies exam, so maybe other ones are worse.

Expand full comment

I agree on how easy this test is, especially compared to the one used previously.

How were the scores so low, because it seems pretty easy. I am not saying I would do as well as the AI, but just from reading your blog posts, I would have scored pretty well.

Expand full comment

Like (1)

Luis Augusto Fretes Cuevas

Apr 5, 2023

I'm incredibly surprised that this is what you use to grade your students. All the questions are incredibly loaded, easy and basically boiled down to "did you read me, paraphrase my believes to me".

Honestly I'm shock that this passes for an exam and I would have been shocked if GPT-4 had failed it, given the things I've seen it accomplish.

Expand full comment

Like (5)

AJ Gyles

Apr 3, 2023

So now the question is, what are you going to do about it? How do you give a fair test when GPT-4 makes it so easy to cheat on a test? The way I see it there are three options, none of them good.

Option 1: The chump. Just keep doing things the way they are now. Some students will use GPT-4 to cheat, some will try to do it themselves. The ones who do it themselves will be punished by a lower score, especially with GPT-4 bumping up the grading curve. Eventually, all the top students will be cheaters.

Option 2: The prison ward. All students are forced to take the tests in person, with pencil and paper only. All of them are forced to turn over all cell phones and electronic devices before the test, maybe going through something like airport security. The teacher must sit and watch them the entire time. Everyone will be miserable, but at least it's still fair.

Option 3: Give up. Classes will no longer be graded. This will make it very hard for students to motivate themselves to study, and also very hard for grad schools/employers to tell which students did a good job in the classes. The value of a college degree continues its downward slide.

Like I said, I really don't like any of these options, but what else is there? It seems terrible that we've invented such a powerful cheating device with no defense.

Expand full comment

Option 2 is exactly how I had to take ALL exams. It's not really that burdensome or unreasonable. It does feel a bit like prison, but then you are essentially applying for parole from the educational system!

Expand full comment

Like (5)

Kash

Apr 3, 2023

Option 2 is really not that difficult, smart phones have been around for a while, it's not that hard to prevent use in person. The problem is no more take-home work.

Expand full comment

Like (3)

Vincent Cook

Apr 3, 2023

Even airport-style security might not prevent prepositioning or smuggling of devices into the test area. Another layer of security to add to Option 2 would be to use electronic countermeasures to cut off wireless internet access locally, like building a Faraday cage around the test area and/or employing a jamming device during the test. Random seating assignments, unknown to the test-takers beforehand, would also be a prudent move.

Of course, there may be an Option 4 in the not-too-distant future--advanced text-based machine learning tools will likely be developed for detecting cheating on answers to essay-type questions, including collusion among multiple test-takers as well as the use of ChatGPT-like tools or straight copying from on-line sources to assist in plagiarism by a test-taker. I'm not sure, though, that the fear of an electronic minder flagging one's answers as suspicious would make a student any less miserable than old fashioned in-person proctoring by humans does.

Since the trend in higher education these days is towards giving up on merit-based standards anyways in favor of "equity" in handing out credentials, I expect that Option 3 will undoubtedly gain traction, and in the long run earning a college degree will cease being associated with one's intellectual skills. The point of collectivism, of course, is to deny individuals any independent capacity to think and act for themselves; a properly-trained machine serves their purposes much better than an educated human does.

Expand full comment

I'm sorry, "the point in collectivism" is emphatically not "to deny people any capacity to think for themselves". That's an argument against collectivism but it's sure as hell not what collectivists themselves actually believe.

Expand full comment

Many collectivists don't grasp what the logical implications of their own beliefs are, or why their particular ideology puts them in the same broad family of belief systems as rival collectivist ideologies that they often believe to be a polar opposite of their own.

Expand full comment

Reply (1)

Alex Potts

Apr 5, 2023Edited

Fine, but "these people don't understand the implications of their own beliefs" is a very different and much more defensible criticism than "these people do understand the implications and think those implications are good".

Expand full comment

Reply (1)

Vincent Cook

Apr 6, 2023

Sure, but I described the denial of the intellectual and moral autonomy of individuals as the point of collectivism (i.e. a defining characteristic of a broad family of ideologies), which isn't necessarily the point that would be made or emphasized by a given collectivist.

Expand full comment

Matthew Barnett

Apr 5, 2023

You could give people an oral exam instead. There doesn't seem to be any way to cheat at that yet, though it will become possible eventually.

Expand full comment

Like (1)

St. Jerome Powell

Apr 7, 2023

This is honestly an astoundingly vacuous exam.

Expand full comment

Like (4)

Bwhilders

Apr 3, 2023

The response to the second question (asking about why liberal Californians moving to Texas is surprising") tells me that GTP-4 is ready to host its own television show.

Expand full comment

Like (4)

Christopher Romanowski

Apr 7, 2023

Stumbled across this due to my interest in the AI content, but more interesting is the content of this "economics" exam. Q1 is really the only objective, non-loaded question on the exam, and I'd wager any real economist that didn't write this exam would score ChatGPT's answers better than Prof. Caplan's "suggested answers." Want to Bet On It??

Q3: "It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided." Perfect example of why so many scientists don't consider economics to be a science at all- Lack of testable hypotheses, lack of consensus, and inherent political overtones.

Q2: Every year, 60-80,000 Californians move to Texas and 35-40,000 Texans move to California. What would be surprising is if all those individuals moved for political reasons. In liberal California, 49% of residents identify as Democrat, 30% identify as Republican, 21% no lean. In conservative Texas, 40% of residents identify as Democrat, 39% identify as Republican, 21% no lean. Should we be surprised when literally half of each state's population don't move across the country to better self-sort by political affiliation? There's more to life than party affiliation, and party affiliation is more complex than any single issue such as guns or taxes. https://www.pewresearch.org/religion/religious-landscape-study/compare/party-affiliation/by/state/

Q5: Even Milton Friedman agreed that occupational licensing benefits can outweigh the costs in some cases where there is a public safety concern. An economist should also consider external costs, not only individual costs. Should we let a budget airline choose to save a few bucks by hiring cheaper unliscensed pilots? Should we allow insurance companies to save a few bucks by approving only cheaper unliscensed surgeons?

Expand full comment

Like (3)

Steven Landsburg

Apr 5, 2023

By contrast, GPT-4 failed my final exam egregiously: https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/

Expand full comment

Like (3)

Bernke

Apr 4, 2023

ChatGPT(not sure if GPT4 or not) scores a zero on Steven Landsburg's exam: https://www.thebigquestions.com/2023/04/04/chatgpt-fails-economics/

Expand full comment

Like (3)

Mike H

Apr 3, 2023Edited

Does this update any of your priors about other things machine learning programs may be able to do in the near future and its likely consequences?

Expand full comment

Like (3)

Apr 4, 2023

For what it's worth, I had trouble replicating results for Question 1 on Bing Chat, which I thought was powered by GPT-4. I tried all three of "Creative Mode", "Balanced Mode", and "Precise Mode". In Balanced Mode, Bing Chat actually found this Bet On It post and saw Bryan Caplan give the answer as 80M but went through its own calculation anyways and came up with the wrong answer. In Creative Mode, it just gave an incorrect answer. In Precise Mode, it did the correct calculation.

However, I was suspicious because it clearly already knew about this Bet On It post. So, I decided to ask a similar question in Precise Mode to see whether it could answer. I asked, "Suppose that local authorities are considering banning a harmful technology. They estimate that the ban will reduce quality of life by 10% for a year but save 732 life-years. How high does total population have to be before the ban fails CBA?"

Bing Chat replied: "Cost-benefit analysis (CBA) is a technique used to compare the total costs of a policy or project with its benefits. In this case, the cost of the ban is a 10% reduction in quality of life for one year and the benefit is 732 life-years saved. To determine whether the ban fails CBA, we need to know the value of a statistical life-year (VSLY) and the total population. The ban fails CBA if the total cost of the ban (10% reduction in quality of life for one year for the entire population) is greater than the total benefit (732 life-years saved * VSLY). Without knowing the VSLY and total population, it is not possible to determine whether the ban fails CBA."

This version of the problem is even easier than the original because one doesn't need to subtract two numbers to get the net benefit of 732 life-years. I eliminated the required subtraction step because I didn't want Bing Chat to just be able to use the original solution in this Bet On It post. Using the same solution method as the original problem, the answer is that the ban fails CBA for populations above 732/(0.10) = 7320. (VSLY appears on both the cost and benefit side and, thus, cancels.)

Was there any special prompting that Collin Gray had to use to get GPT to work? It's possible that Bing Chat's implementation is just worse than other GPT-4 implementations. I have found Bing Chat makes many mathematical errors, e.g., algebraic errors, even though others have reported GPT-4 doing very well on SAT-Math and GRE-Quant. It's also possible that I'm not prompting it skillfully.

Expand full comment

I didn't do any prompt engineering in giving the test to ChatGPT, it was identical to the original test in content and format.

As a sanity check I fed your modified question into ChatGPT(GPT-4) and it got the right answer, and followed the same steps as in its previous answer.

I'm guessing that some of the difference in responses between Bing Chat and ChatGPT comes down to fine-tuning direction. Along the pareto frontier of giving you a plain answer vs. not over-simplifying, Bing Chat seems to stray more towards not over-simplifying, causing it gives disclaimers in places where ChatGPT just rolls with implied simplifying assumptions.

Expand full comment

Like (1)

Vella Rae

Apr 8, 2023

It seems that the AI is been trained to become a teacher's pet and docilely submit to indoctrination to align with whatever fuzzy philosophy the teacher is espousing.

Expand full comment

Like (1)

Phemepark

Apr 7, 2023

But doesn‘t testing GPT-4 on your exams contradict your free-market principles?

Shouldn‘t consumers be free to (blindly) trade off quality against price in markets with asymmetrical information (market for lemons)? Didn’t Milton Friedman urge us to be “free to AI hallucinate”?

Expand full comment

Like (1)

RRC

Apr 6, 2023

How much time did people have to complete the test? Do students usually get a chance to answer every question? If students had more time to do the test, would they do a lot better?

Consider that it's quite easy to make an exam that humans suck at and chat gpt sucks at less:

1. Make a very long test of easy questions

2. Give students a short amount of time to take the test—the shorter the better

Predictably, chat-gpt is able to answer all of the questions. The questions are fairly easy so it does okay on most of them and it gets partial credit for all the questions. Humans weren't able to answer every questions so they do a lot worse. However, you might expect the number of questions humans got perfect scores on to be higher than the number chatgpt got.

Expand full comment

Like (1)