The real question is whether the AI’s exam performance means anything at all. Studies show very little overlap between what AIs do in school and the skills they actually need on the job.
My very limited experience with ChatGPT is that it will give you a shallow summary of something with a lot of data on the internet without taking much of a side.
That's probably passes the test in many tasks but not all.
Do you need mediocre but cheap answers to things without deep understanding? We've got a Voxsplainer writer in a box!
This is partially because the designers are terrified of it being offensive. They have explicitly said they've tried to make it as inoffensive as possible.
If your goal is to actually identify breakthrough technologies even slightly ahead of the curve, then I don't think it's helpful to apply base rates, for this exact reason. You will always predict “no”, you will be right 95+% of the time, and you will miss every transformative technology until it's too obvious to ignore.
I think AI is on a strong trajectory to be extremely useful, but I'm not sure I would take this bet. “Passing exams” is not an economically useful function (except to students who want to cheat?) and it's not clear to me that AI will be engineered or optimized for this. If you picked something with a clear economic value, like generating marketing copy or writing scripts for TV and movies, I would be much more likely to take the bet.
If you interpret 'apply a 95% negative base rate' as 'just say no to all transformative techs', then of course you're right. But that's not really how one should apply a base rate. You just use Bayes rule, and allow the negative base rate to pre-weight your odds that a given tech will be transformative appropriately low.
Good point, but if you're really seriously doing that then I don't see how you could dismiss everything that AI has just become capable of in the last couple of years. That is an extremely strong trajectory towards some very fundamental capabilities—far more than enough to overcome 19:1 odds.
This boils down to what we mean by transformative, at least in my view. I mean, my personal evaluation is that AI is 90+% likely to be very useful as a tool in many fields by 2030. It's FAR less likely to replace entire fields. I'm not clear exactly what Bryan is estimating here.
This seems like the correct interpretation to me. In any event, ChatGPT (or a similar tech) purportedly has passed assorted medical exams and bar exams. So I don't know what insight is gained by this bet. You can make a test arbitrarily difficult, such that ChatGPT or its future descendants can't pass it, but what does that prove other than arbitrary difficulty?
Shouldn't you use a third-party grader or even a set of graders? Grading is inherently subjective. What you consider a D, another professor might consider a C depending on the rubric, their mood, student quality, etc. And even if we assume no progress in this technology, which seems unlikely, a beta version of a new tech scored a marginally passing grade in an advanced economics course - probably as good or better than a substantial percentage of all college students in the country. That seems pretty amazing to me.
How about blinding the AI’s exam by including it with all other student’s exams for grading? That way, Brian won’t know whether he’s grading a human student or the AI.
Wouldn't the right way to do this be to include the AI test among the exams you actually grade during the semester, without identifying it as an AI test? Grading without knowing the identity of the student who wrote the test is probably good for a variety of reasons (though it can introduce complications if you're dealing with essays that students have worked on drafts of) and would make the test more fair.
You should do this in a blinded way! You likely will grade the AI very differently because you know it is an AI. My old econ teacher used to do this to avoid bias – have students write their name on the back of the last page.
Somebody posted this joke on twitter, apologies I forgot who, but i think it is highly relevant -
I was in the park the other day, and walked past a man playing chess against a dog. "Wow," I said , "That's a smart dog."
"Not that smart," the man replied. "I'm winning 3 games to 1."
Seriously, what % of the population could get a D or higher on an labor econ midterm. Maybe 10%?
For certain tasks the ChatGPT is already outperforming humans (eg some coding tasks, organizing rough notes in to a coherent structure). It's underperforming on internal consistency of answers and general knowledge. But I can't imagine those things won't be fixed in six years.
One thing I don't understand: if Matthew is right, why would he pick the 6 latest midterms from ~2028? If he's right, professors might be forced to change their assignments and midterms by that point. I think you should the 6 latest midterms from today, not from 6 years from now.
Additionally, by allowing "any AI selected by Matthew" does that mean you'd allow Matthew to train an AI on your class lectures and midterms? Because if so, there's a chance ChatGPT could pass right now with the right training.
You miss 100% of the moonshots you don't take - that's the problem with the base rate argument
That said, I think you're correct when it comes to Generative AI
I think ChatGPT was a PR stunt for potentially more valuable but far less flashy use cases, such as B2B automation, data aggregation, the workplace etc.
There is a reason that Microsoft is the biggest investor in OpenAI
My prediction: By 2029, it will be common knowledge that AI aces college exams, in general.
However, Bryan's exams are idiosyncratic enough that the AI might not quite hit this high grading bar, due to being trained on conventional economics textbooks (Krugman etc.) So I think Bryan will win the bet. The AI would need to be trained on his lecture transcripts to avoid this issue.
Great bet! I've posted this elsewhere, but you (and other commenters) may be interested in seeing a working data scientist's opinion about ChatGPT that I wrote about a month ago, wherein I more-or-less agree with the sentiment of "grossly overpromising and underdelivering:" https://ipsherman.substack.com/p/an-opinion-about-ai-chatgpt-and-more
Unrelatedly, in 2021 I did a post on how much to worry about COVID for kids. I wouldn't usually comment at all, let alone about something unrelated, but in this post I refer to Kahneman’s maxim as well (using the same terminology!): https://ipsherman.wordpress.com/2021/09/11/why-i-dont-make-my-kids-wear-masks/ <- I was (at least partially) inspired to write this and a previous post by your questions about how much worse was COVID than the normal flu.
Thank you Professor Caplan for your years of insightful, prolific, social-desirability-bias-eschewing blogging!
The real question is whether the AI’s exam performance means anything at all. Studies show very little overlap between what AIs do in school and the skills they actually need on the job.
My very limited experience with ChatGPT is that it will give you a shallow summary of something with a lot of data on the internet without taking much of a side.
That's probably passes the test in many tasks but not all.
Do you need mediocre but cheap answers to things without deep understanding? We've got a Voxsplainer writer in a box!
This is partially because the designers are terrified of it being offensive. They have explicitly said they've tried to make it as inoffensive as possible.
Ok, but I asked it a question about my industry that doesn't touch on race or sex or anything and the output was just as mediocre.
If your goal is to actually identify breakthrough technologies even slightly ahead of the curve, then I don't think it's helpful to apply base rates, for this exact reason. You will always predict “no”, you will be right 95+% of the time, and you will miss every transformative technology until it's too obvious to ignore.
I think AI is on a strong trajectory to be extremely useful, but I'm not sure I would take this bet. “Passing exams” is not an economically useful function (except to students who want to cheat?) and it's not clear to me that AI will be engineered or optimized for this. If you picked something with a clear economic value, like generating marketing copy or writing scripts for TV and movies, I would be much more likely to take the bet.
https://astralcodexten.substack.com/p/heuristics-that-almost-always-work
If you interpret 'apply a 95% negative base rate' as 'just say no to all transformative techs', then of course you're right. But that's not really how one should apply a base rate. You just use Bayes rule, and allow the negative base rate to pre-weight your odds that a given tech will be transformative appropriately low.
Good point, but if you're really seriously doing that then I don't see how you could dismiss everything that AI has just become capable of in the last couple of years. That is an extremely strong trajectory towards some very fundamental capabilities—far more than enough to overcome 19:1 odds.
This boils down to what we mean by transformative, at least in my view. I mean, my personal evaluation is that AI is 90+% likely to be very useful as a tool in many fields by 2030. It's FAR less likely to replace entire fields. I'm not clear exactly what Bryan is estimating here.
This seems like the correct interpretation to me. In any event, ChatGPT (or a similar tech) purportedly has passed assorted medical exams and bar exams. So I don't know what insight is gained by this bet. You can make a test arbitrarily difficult, such that ChatGPT or its future descendants can't pass it, but what does that prove other than arbitrary difficulty?
Since he expects his students to pass it, it can't be arbitrarily difficult.
D is not bad for a guy that didn't attend to your lectures.
Shouldn't you use a third-party grader or even a set of graders? Grading is inherently subjective. What you consider a D, another professor might consider a C depending on the rubric, their mood, student quality, etc. And even if we assume no progress in this technology, which seems unlikely, a beta version of a new tech scored a marginally passing grade in an advanced economics course - probably as good or better than a substantial percentage of all college students in the country. That seems pretty amazing to me.
Caplan's exams also seem hard. His grading seems particularly demanding (as his Rate My Professor reviews confirm).
How about blinding the AI’s exam by including it with all other student’s exams for grading? That way, Brian won’t know whether he’s grading a human student or the AI.
Seems fun, but I don't think Caplan is that biased.
As in it would be fun to see Bryan's reaction to it being an AI.
Wouldn't the right way to do this be to include the AI test among the exams you actually grade during the semester, without identifying it as an AI test? Grading without knowing the identity of the student who wrote the test is probably good for a variety of reasons (though it can introduce complications if you're dealing with essays that students have worked on drafts of) and would make the test more fair.
You should do this in a blinded way! You likely will grade the AI very differently because you know it is an AI. My old econ teacher used to do this to avoid bias – have students write their name on the back of the last page.
+1
This is probably the first Bryan bet I've thought he was way off the mark on. Exciting!
I have a feeling that Caplan will either become an especially hard grader or that he will lose this bet!
Now we need a prediction market on this bet. I'd go for the AI's side, certainly at evens.
Somebody posted this joke on twitter, apologies I forgot who, but i think it is highly relevant -
I was in the park the other day, and walked past a man playing chess against a dog. "Wow," I said , "That's a smart dog."
"Not that smart," the man replied. "I'm winning 3 games to 1."
Seriously, what % of the population could get a D or higher on an labor econ midterm. Maybe 10%?
For certain tasks the ChatGPT is already outperforming humans (eg some coding tasks, organizing rough notes in to a coherent structure). It's underperforming on internal consistency of answers and general knowledge. But I can't imagine those things won't be fixed in six years.
One thing I don't understand: if Matthew is right, why would he pick the 6 latest midterms from ~2028? If he's right, professors might be forced to change their assignments and midterms by that point. I think you should the 6 latest midterms from today, not from 6 years from now.
Additionally, by allowing "any AI selected by Matthew" does that mean you'd allow Matthew to train an AI on your class lectures and midterms? Because if so, there's a chance ChatGPT could pass right now with the right training.
You miss 100% of the moonshots you don't take - that's the problem with the base rate argument
That said, I think you're correct when it comes to Generative AI
I think ChatGPT was a PR stunt for potentially more valuable but far less flashy use cases, such as B2B automation, data aggregation, the workplace etc.
There is a reason that Microsoft is the biggest investor in OpenAI
My prediction: By 2029, it will be common knowledge that AI aces college exams, in general.
However, Bryan's exams are idiosyncratic enough that the AI might not quite hit this high grading bar, due to being trained on conventional economics textbooks (Krugman etc.) So I think Bryan will win the bet. The AI would need to be trained on his lecture transcripts to avoid this issue.
"2. Bryan will then grade the AI's work, as if it were one of his students"
How will you know you'll be fair? Will you accept the 6 manuscripts shuffled in between your students and grade anonymously?
Great bet! I've posted this elsewhere, but you (and other commenters) may be interested in seeing a working data scientist's opinion about ChatGPT that I wrote about a month ago, wherein I more-or-less agree with the sentiment of "grossly overpromising and underdelivering:" https://ipsherman.substack.com/p/an-opinion-about-ai-chatgpt-and-more
Unrelatedly, in 2021 I did a post on how much to worry about COVID for kids. I wouldn't usually comment at all, let alone about something unrelated, but in this post I refer to Kahneman’s maxim as well (using the same terminology!): https://ipsherman.wordpress.com/2021/09/11/why-i-dont-make-my-kids-wear-masks/ <- I was (at least partially) inspired to write this and a previous post by your questions about how much worse was COVID than the normal flu.
Thank you Professor Caplan for your years of insightful, prolific, social-desirability-bias-eschewing blogging!