The wonder is not GPT's performance but that it was done under such a capricious and arbitrary grading scheme. Once again I am grateful never to have faced one of Caplan's tests.
I thought the same thing, but I think these questions are based on texts that are missing from blog post. This seems closer to a reading comprehension test, "what does the author (Bryan Caplan) think about x?"
Probably a curve. I had plenty of tests like that.
My favorite was a prof who the mean was the B/C borderline and a grade was 1 SD wide. So above 1 SD above mean was an A, between 0-1 was a B, 0 to -1 was a C, -1 to -2 was a D, below -2 SD was an F.
First test in class I got something like a 92 and it was a B+. Second test, 58 was an A-. Third test the mean was about the same as second test and I scored an 80-something so locked up my A with ease.
A system like that is neither capricious nor arbitrary, its just non-standard.
>It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.
This is just blatantly docking points for disagreeing with your politics. It's easy to justify an UBI experiment as an EA even if you think UBI wouldn't be the best thing to *personally* fund, if you think UBI has a good chance of being a significant improvement over the status quo and an experiment might be a low-cost way of helping bring about that change. EAs don't have to *just* be about malaria nets for the very poor.
I also concur with the sentiments of the other posters that the questions and answers are not that difficult but obviously loaded to make a point about your politics.
I wonder if the next generation of AI will seek to do well on tests by predicting how to flatter the graders.
Funnily enough, the success of GPT4 is pushing me in a more pro-UBI direction. I've always maintained that UBI is not, as its supporters generally claim, an efficient use of money in the current economy; either it doesn't give enough money to people who need it, or it wastes vast sums on people who don't need it.
But in an economy with mass unemployment from automation, that calculus changes. Suddenly if most people really need state support, while AI is delivering untold riches to a select few, then UBI becomes a very good idea. And that scenario might not be all that far off.
Bryan, is this going to make you update your thoughts on AI safety?
If 'capabilities' is 5 years ahead of what you expected (this should be a huge shock to your model of AI) would you be worried if AI safety/alignment is 5 years+ behind capabilities?
Apart from all the AI stuff, the exam seems so weird an biased. First of all, the questions all seem really easy, I feel like I could pass the exam by reading 7 random Caplan blog posts. And they seem so biased. Especially number two. Politics is the mind killer (https://www.lesswrong.com/posts/9weLK2AJ9JEt2Tt8f/politics-is-the-mind-killer) if you want people to think rationally or to learn something, don't use politics except if you have to. Six is also kind of weird but I do like the mention of EA.
But I have never seen another Public Policies exam, so maybe other ones are worse.
I agree on how easy this test is, especially compared to the one used previously.
How were the scores so low, because it seems pretty easy. I am not saying I would do as well as the AI, but just from reading your blog posts, I would have scored pretty well.
I'm incredibly surprised that this is what you use to grade your students. All the questions are incredibly loaded, easy and basically boiled down to "did you read me, paraphrase my believes to me".
Honestly I'm shock that this passes for an exam and I would have been shocked if GPT-4 had failed it, given the things I've seen it accomplish.
So now the question is, what are you going to do about it? How do you give a fair test when GPT-4 makes it so easy to cheat on a test? The way I see it there are three options, none of them good.
Option 1: The chump. Just keep doing things the way they are now. Some students will use GPT-4 to cheat, some will try to do it themselves. The ones who do it themselves will be punished by a lower score, especially with GPT-4 bumping up the grading curve. Eventually, all the top students will be cheaters.
Option 2: The prison ward. All students are forced to take the tests in person, with pencil and paper only. All of them are forced to turn over all cell phones and electronic devices before the test, maybe going through something like airport security. The teacher must sit and watch them the entire time. Everyone will be miserable, but at least it's still fair.
Option 3: Give up. Classes will no longer be graded. This will make it very hard for students to motivate themselves to study, and also very hard for grad schools/employers to tell which students did a good job in the classes. The value of a college degree continues its downward slide.
Like I said, I really don't like any of these options, but what else is there? It seems terrible that we've invented such a powerful cheating device with no defense.
Option 2 is exactly how I had to take ALL exams. It's not really that burdensome or unreasonable. It does feel a bit like prison, but then you are essentially applying for parole from the educational system!
Option 2 is really not that difficult, smart phones have been around for a while, it's not that hard to prevent use in person. The problem is no more take-home work.
Even airport-style security might not prevent prepositioning or smuggling of devices into the test area. Another layer of security to add to Option 2 would be to use electronic countermeasures to cut off wireless internet access locally, like building a Faraday cage around the test area and/or employing a jamming device during the test. Random seating assignments, unknown to the test-takers beforehand, would also be a prudent move.
Of course, there may be an Option 4 in the not-too-distant future--advanced text-based machine learning tools will likely be developed for detecting cheating on answers to essay-type questions, including collusion among multiple test-takers as well as the use of ChatGPT-like tools or straight copying from on-line sources to assist in plagiarism by a test-taker. I'm not sure, though, that the fear of an electronic minder flagging one's answers as suspicious would make a student any less miserable than old fashioned in-person proctoring by humans does.
Since the trend in higher education these days is towards giving up on merit-based standards anyways in favor of "equity" in handing out credentials, I expect that Option 3 will undoubtedly gain traction, and in the long run earning a college degree will cease being associated with one's intellectual skills. The point of collectivism, of course, is to deny individuals any independent capacity to think and act for themselves; a properly-trained machine serves their purposes much better than an educated human does.
I'm sorry, "the point in collectivism" is emphatically not "to deny people any capacity to think for themselves". That's an argument against collectivism but it's sure as hell not what collectivists themselves actually believe.
Many collectivists don't grasp what the logical implications of their own beliefs are, or why their particular ideology puts them in the same broad family of belief systems as rival collectivist ideologies that they often believe to be a polar opposite of their own.
Fine, but "these people don't understand the implications of their own beliefs" is a very different and much more defensible criticism than "these people do understand the implications and think those implications are good".
Sure, but I described the denial of the intellectual and moral autonomy of individuals as the point of collectivism (i.e. a defining characteristic of a broad family of ideologies), which isn't necessarily the point that would be made or emphasized by a given collectivist.
The response to the second question (asking about why liberal Californians moving to Texas is surprising") tells me that GTP-4 is ready to host its own television show.
Stumbled across this due to my interest in the AI content, but more interesting is the content of this "economics" exam. Q1 is really the only objective, non-loaded question on the exam, and I'd wager any real economist that didn't write this exam would score ChatGPT's answers better than Prof. Caplan's "suggested answers." Want to Bet On It??
Q3: "It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided." Perfect example of why so many scientists don't consider economics to be a science at all- Lack of testable hypotheses, lack of consensus, and inherent political overtones.
Q2: Every year, 60-80,000 Californians move to Texas and 35-40,000 Texans move to California. What would be surprising is if all those individuals moved for political reasons. In liberal California, 49% of residents identify as Democrat, 30% identify as Republican, 21% no lean. In conservative Texas, 40% of residents identify as Democrat, 39% identify as Republican, 21% no lean. Should we be surprised when literally half of each state's population don't move across the country to better self-sort by political affiliation? There's more to life than party affiliation, and party affiliation is more complex than any single issue such as guns or taxes. https://www.pewresearch.org/religion/religious-landscape-study/compare/party-affiliation/by/state/
Q5: Even Milton Friedman agreed that occupational licensing benefits can outweigh the costs in some cases where there is a public safety concern. An economist should also consider external costs, not only individual costs. Should we let a budget airline choose to save a few bucks by hiring cheaper unliscensed pilots? Should we allow insurance companies to save a few bucks by approving only cheaper unliscensed surgeons?
For what it's worth, I had trouble replicating results for Question 1 on Bing Chat, which I thought was powered by GPT-4. I tried all three of "Creative Mode", "Balanced Mode", and "Precise Mode". In Balanced Mode, Bing Chat actually found this Bet On It post and saw Bryan Caplan give the answer as 80M but went through its own calculation anyways and came up with the wrong answer. In Creative Mode, it just gave an incorrect answer. In Precise Mode, it did the correct calculation.
However, I was suspicious because it clearly already knew about this Bet On It post. So, I decided to ask a similar question in Precise Mode to see whether it could answer. I asked, "Suppose that local authorities are considering banning a harmful technology. They estimate that the ban will reduce quality of life by 10% for a year but save 732 life-years. How high does total population have to be before the ban fails CBA?"
Bing Chat replied: "Cost-benefit analysis (CBA) is a technique used to compare the total costs of a policy or project with its benefits. In this case, the cost of the ban is a 10% reduction in quality of life for one year and the benefit is 732 life-years saved. To determine whether the ban fails CBA, we need to know the value of a statistical life-year (VSLY) and the total population. The ban fails CBA if the total cost of the ban (10% reduction in quality of life for one year for the entire population) is greater than the total benefit (732 life-years saved * VSLY). Without knowing the VSLY and total population, it is not possible to determine whether the ban fails CBA."
This version of the problem is even easier than the original because one doesn't need to subtract two numbers to get the net benefit of 732 life-years. I eliminated the required subtraction step because I didn't want Bing Chat to just be able to use the original solution in this Bet On It post. Using the same solution method as the original problem, the answer is that the ban fails CBA for populations above 732/(0.10) = 7320. (VSLY appears on both the cost and benefit side and, thus, cancels.)
Was there any special prompting that Collin Gray had to use to get GPT to work? It's possible that Bing Chat's implementation is just worse than other GPT-4 implementations. I have found Bing Chat makes many mathematical errors, e.g., algebraic errors, even though others have reported GPT-4 doing very well on SAT-Math and GRE-Quant. It's also possible that I'm not prompting it skillfully.
I didn't do any prompt engineering in giving the test to ChatGPT, it was identical to the original test in content and format.
As a sanity check I fed your modified question into ChatGPT(GPT-4) and it got the right answer, and followed the same steps as in its previous answer.
I'm guessing that some of the difference in responses between Bing Chat and ChatGPT comes down to fine-tuning direction. Along the pareto frontier of giving you a plain answer vs. not over-simplifying, Bing Chat seems to stray more towards not over-simplifying, causing it gives disclaimers in places where ChatGPT just rolls with implied simplifying assumptions.
It seems that the AI is been trained to become a teacher's pet and docilely submit to indoctrination to align with whatever fuzzy philosophy the teacher is espousing.
But doesn‘t testing GPT-4 on your exams contradict your free-market principles?
Shouldn‘t consumers be free to (blindly) trade off quality against price in markets with asymmetrical information (market for lemons)? Didn’t Milton Friedman urge us to be “free to AI hallucinate”?
How much time did people have to complete the test? Do students usually get a chance to answer every question? If students had more time to do the test, would they do a lot better?
Consider that it's quite easy to make an exam that humans suck at and chat gpt sucks at less:
1. Make a very long test of easy questions
2. Give students a short amount of time to take the test—the shorter the better
Predictably, chat-gpt is able to answer all of the questions. The questions are fairly easy so it does okay on most of them and it gets partial credit for all the questions. Humans weren't able to answer every questions so they do a lot worse. However, you might expect the number of questions humans got perfect scores on to be higher than the number chatgpt got.
The wonder is not GPT's performance but that it was done under such a capricious and arbitrary grading scheme. Once again I am grateful never to have faced one of Caplan's tests.
On the contrary, I would have been delighted to have been able to get an A with such low scores. A in exams I took meant getting 70%, or 85%.
I thought the same thing, but I think these questions are based on texts that are missing from blog post. This seems closer to a reading comprehension test, "what does the author (Bryan Caplan) think about x?"
Probably a curve. I had plenty of tests like that.
My favorite was a prof who the mean was the B/C borderline and a grade was 1 SD wide. So above 1 SD above mean was an A, between 0-1 was a B, 0 to -1 was a C, -1 to -2 was a D, below -2 SD was an F.
First test in class I got something like a 92 and it was a B+. Second test, 58 was an A-. Third test the mean was about the same as second test and I scored an 80-something so locked up my A with ease.
A system like that is neither capricious nor arbitrary, its just non-standard.
>It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided.
This is just blatantly docking points for disagreeing with your politics. It's easy to justify an UBI experiment as an EA even if you think UBI wouldn't be the best thing to *personally* fund, if you think UBI has a good chance of being a significant improvement over the status quo and an experiment might be a low-cost way of helping bring about that change. EAs don't have to *just* be about malaria nets for the very poor.
I also concur with the sentiments of the other posters that the questions and answers are not that difficult but obviously loaded to make a point about your politics.
I wonder if the next generation of AI will seek to do well on tests by predicting how to flatter the graders.
Funnily enough, the success of GPT4 is pushing me in a more pro-UBI direction. I've always maintained that UBI is not, as its supporters generally claim, an efficient use of money in the current economy; either it doesn't give enough money to people who need it, or it wastes vast sums on people who don't need it.
But in an economy with mass unemployment from automation, that calculus changes. Suddenly if most people really need state support, while AI is delivering untold riches to a select few, then UBI becomes a very good idea. And that scenario might not be all that far off.
Bryan, is this going to make you update your thoughts on AI safety?
If 'capabilities' is 5 years ahead of what you expected (this should be a huge shock to your model of AI) would you be worried if AI safety/alignment is 5 years+ behind capabilities?
Apart from all the AI stuff, the exam seems so weird an biased. First of all, the questions all seem really easy, I feel like I could pass the exam by reading 7 random Caplan blog posts. And they seem so biased. Especially number two. Politics is the mind killer (https://www.lesswrong.com/posts/9weLK2AJ9JEt2Tt8f/politics-is-the-mind-killer) if you want people to think rationally or to learn something, don't use politics except if you have to. Six is also kind of weird but I do like the mention of EA.
But I have never seen another Public Policies exam, so maybe other ones are worse.
I agree on how easy this test is, especially compared to the one used previously.
How were the scores so low, because it seems pretty easy. I am not saying I would do as well as the AI, but just from reading your blog posts, I would have scored pretty well.
I'm incredibly surprised that this is what you use to grade your students. All the questions are incredibly loaded, easy and basically boiled down to "did you read me, paraphrase my believes to me".
Honestly I'm shock that this passes for an exam and I would have been shocked if GPT-4 had failed it, given the things I've seen it accomplish.
So now the question is, what are you going to do about it? How do you give a fair test when GPT-4 makes it so easy to cheat on a test? The way I see it there are three options, none of them good.
Option 1: The chump. Just keep doing things the way they are now. Some students will use GPT-4 to cheat, some will try to do it themselves. The ones who do it themselves will be punished by a lower score, especially with GPT-4 bumping up the grading curve. Eventually, all the top students will be cheaters.
Option 2: The prison ward. All students are forced to take the tests in person, with pencil and paper only. All of them are forced to turn over all cell phones and electronic devices before the test, maybe going through something like airport security. The teacher must sit and watch them the entire time. Everyone will be miserable, but at least it's still fair.
Option 3: Give up. Classes will no longer be graded. This will make it very hard for students to motivate themselves to study, and also very hard for grad schools/employers to tell which students did a good job in the classes. The value of a college degree continues its downward slide.
Like I said, I really don't like any of these options, but what else is there? It seems terrible that we've invented such a powerful cheating device with no defense.
Option 2 is exactly how I had to take ALL exams. It's not really that burdensome or unreasonable. It does feel a bit like prison, but then you are essentially applying for parole from the educational system!
Option 2 is really not that difficult, smart phones have been around for a while, it's not that hard to prevent use in person. The problem is no more take-home work.
Even airport-style security might not prevent prepositioning or smuggling of devices into the test area. Another layer of security to add to Option 2 would be to use electronic countermeasures to cut off wireless internet access locally, like building a Faraday cage around the test area and/or employing a jamming device during the test. Random seating assignments, unknown to the test-takers beforehand, would also be a prudent move.
Of course, there may be an Option 4 in the not-too-distant future--advanced text-based machine learning tools will likely be developed for detecting cheating on answers to essay-type questions, including collusion among multiple test-takers as well as the use of ChatGPT-like tools or straight copying from on-line sources to assist in plagiarism by a test-taker. I'm not sure, though, that the fear of an electronic minder flagging one's answers as suspicious would make a student any less miserable than old fashioned in-person proctoring by humans does.
Since the trend in higher education these days is towards giving up on merit-based standards anyways in favor of "equity" in handing out credentials, I expect that Option 3 will undoubtedly gain traction, and in the long run earning a college degree will cease being associated with one's intellectual skills. The point of collectivism, of course, is to deny individuals any independent capacity to think and act for themselves; a properly-trained machine serves their purposes much better than an educated human does.
I'm sorry, "the point in collectivism" is emphatically not "to deny people any capacity to think for themselves". That's an argument against collectivism but it's sure as hell not what collectivists themselves actually believe.
Many collectivists don't grasp what the logical implications of their own beliefs are, or why their particular ideology puts them in the same broad family of belief systems as rival collectivist ideologies that they often believe to be a polar opposite of their own.
Fine, but "these people don't understand the implications of their own beliefs" is a very different and much more defensible criticism than "these people do understand the implications and think those implications are good".
Sure, but I described the denial of the intellectual and moral autonomy of individuals as the point of collectivism (i.e. a defining characteristic of a broad family of ideologies), which isn't necessarily the point that would be made or emphasized by a given collectivist.
You could give people an oral exam instead. There doesn't seem to be any way to cheat at that yet, though it will become possible eventually.
This is honestly an astoundingly vacuous exam.
The response to the second question (asking about why liberal Californians moving to Texas is surprising") tells me that GTP-4 is ready to host its own television show.
Stumbled across this due to my interest in the AI content, but more interesting is the content of this "economics" exam. Q1 is really the only objective, non-loaded question on the exam, and I'd wager any real economist that didn't write this exam would score ChatGPT's answers better than Prof. Caplan's "suggested answers." Want to Bet On It??
Q3: "It also misses the real point of a UBI experiment: To convince believers that this obviously misguided philanthropic strategy is misguided." Perfect example of why so many scientists don't consider economics to be a science at all- Lack of testable hypotheses, lack of consensus, and inherent political overtones.
Q2: Every year, 60-80,000 Californians move to Texas and 35-40,000 Texans move to California. What would be surprising is if all those individuals moved for political reasons. In liberal California, 49% of residents identify as Democrat, 30% identify as Republican, 21% no lean. In conservative Texas, 40% of residents identify as Democrat, 39% identify as Republican, 21% no lean. Should we be surprised when literally half of each state's population don't move across the country to better self-sort by political affiliation? There's more to life than party affiliation, and party affiliation is more complex than any single issue such as guns or taxes. https://www.pewresearch.org/religion/religious-landscape-study/compare/party-affiliation/by/state/
Q5: Even Milton Friedman agreed that occupational licensing benefits can outweigh the costs in some cases where there is a public safety concern. An economist should also consider external costs, not only individual costs. Should we let a budget airline choose to save a few bucks by hiring cheaper unliscensed pilots? Should we allow insurance companies to save a few bucks by approving only cheaper unliscensed surgeons?
By contrast, GPT-4 failed my final exam egregiously: https://www.thebigquestions.com/2023/04/05/gpt-4-fails-economics/
ChatGPT(not sure if GPT4 or not) scores a zero on Steven Landsburg's exam: https://www.thebigquestions.com/2023/04/04/chatgpt-fails-economics/
Does this update any of your priors about other things machine learning programs may be able to do in the near future and its likely consequences?
For what it's worth, I had trouble replicating results for Question 1 on Bing Chat, which I thought was powered by GPT-4. I tried all three of "Creative Mode", "Balanced Mode", and "Precise Mode". In Balanced Mode, Bing Chat actually found this Bet On It post and saw Bryan Caplan give the answer as 80M but went through its own calculation anyways and came up with the wrong answer. In Creative Mode, it just gave an incorrect answer. In Precise Mode, it did the correct calculation.
However, I was suspicious because it clearly already knew about this Bet On It post. So, I decided to ask a similar question in Precise Mode to see whether it could answer. I asked, "Suppose that local authorities are considering banning a harmful technology. They estimate that the ban will reduce quality of life by 10% for a year but save 732 life-years. How high does total population have to be before the ban fails CBA?"
Bing Chat replied: "Cost-benefit analysis (CBA) is a technique used to compare the total costs of a policy or project with its benefits. In this case, the cost of the ban is a 10% reduction in quality of life for one year and the benefit is 732 life-years saved. To determine whether the ban fails CBA, we need to know the value of a statistical life-year (VSLY) and the total population. The ban fails CBA if the total cost of the ban (10% reduction in quality of life for one year for the entire population) is greater than the total benefit (732 life-years saved * VSLY). Without knowing the VSLY and total population, it is not possible to determine whether the ban fails CBA."
This version of the problem is even easier than the original because one doesn't need to subtract two numbers to get the net benefit of 732 life-years. I eliminated the required subtraction step because I didn't want Bing Chat to just be able to use the original solution in this Bet On It post. Using the same solution method as the original problem, the answer is that the ban fails CBA for populations above 732/(0.10) = 7320. (VSLY appears on both the cost and benefit side and, thus, cancels.)
Was there any special prompting that Collin Gray had to use to get GPT to work? It's possible that Bing Chat's implementation is just worse than other GPT-4 implementations. I have found Bing Chat makes many mathematical errors, e.g., algebraic errors, even though others have reported GPT-4 doing very well on SAT-Math and GRE-Quant. It's also possible that I'm not prompting it skillfully.
I didn't do any prompt engineering in giving the test to ChatGPT, it was identical to the original test in content and format.
As a sanity check I fed your modified question into ChatGPT(GPT-4) and it got the right answer, and followed the same steps as in its previous answer.
I'm guessing that some of the difference in responses between Bing Chat and ChatGPT comes down to fine-tuning direction. Along the pareto frontier of giving you a plain answer vs. not over-simplifying, Bing Chat seems to stray more towards not over-simplifying, causing it gives disclaimers in places where ChatGPT just rolls with implied simplifying assumptions.
It seems that the AI is been trained to become a teacher's pet and docilely submit to indoctrination to align with whatever fuzzy philosophy the teacher is espousing.
But doesn‘t testing GPT-4 on your exams contradict your free-market principles?
Shouldn‘t consumers be free to (blindly) trade off quality against price in markets with asymmetrical information (market for lemons)? Didn’t Milton Friedman urge us to be “free to AI hallucinate”?
How much time did people have to complete the test? Do students usually get a chance to answer every question? If students had more time to do the test, would they do a lot better?
Consider that it's quite easy to make an exam that humans suck at and chat gpt sucks at less:
1. Make a very long test of easy questions
2. Give students a short amount of time to take the test—the shorter the better
Predictably, chat-gpt is able to answer all of the questions. The questions are fairly easy so it does okay on most of them and it gets partial credit for all the questions. Humans weren't able to answer every questions so they do a lot worse. However, you might expect the number of questions humans got perfect scores on to be higher than the number chatgpt got.