The real question is whether the AI’s exam performance means anything at all. Studies show very little overlap between what AIs do in school and the skills they actually need on the job.
My very limited experience with ChatGPT is that it will give you a shallow summary of something with a lot of data on the internet without taking much of a side.
That's probably passes the test in many tasks but not all.
Do you need mediocre but cheap answers to things without deep understanding? We've got a Voxsplainer writer in a box!
This is partially because the designers are terrified of it being offensive. They have explicitly said they've tried to make it as inoffensive as possible.
If your goal is to actually identify breakthrough technologies even slightly ahead of the curve, then I don't think it's helpful to apply base rates, for this exact reason. You will always predict “no”, you will be right 95+% of the time, and you will miss every transformative technology until it's too obvious to ignore.
I think AI is on a strong trajectory to be extremely useful, but I'm not sure I would take this bet. “Passing exams” is not an economically useful function (except to students who want to cheat?) and it's not clear to me that AI will be engineered or optimized for this. If you picked something with a clear economic value, like generating marketing copy or writing scripts for TV and movies, I would be much more likely to take the bet.
If you interpret 'apply a 95% negative base rate' as 'just say no to all transformative techs', then of course you're right. But that's not really how one should apply a base rate. You just use Bayes rule, and allow the negative base rate to pre-weight your odds that a given tech will be transformative appropriately low.
Good point, but if you're really seriously doing that then I don't see how you could dismiss everything that AI has just become capable of in the last couple of years. That is an extremely strong trajectory towards some very fundamental capabilities—far more than enough to overcome 19:1 odds.
This boils down to what we mean by transformative, at least in my view. I mean, my personal evaluation is that AI is 90+% likely to be very useful as a tool in many fields by 2030. It's FAR less likely to replace entire fields. I'm not clear exactly what Bryan is estimating here.
This seems like the correct interpretation to me. In any event, ChatGPT (or a similar tech) purportedly has passed assorted medical exams and bar exams. So I don't know what insight is gained by this bet. You can make a test arbitrarily difficult, such that ChatGPT or its future descendants can't pass it, but what does that prove other than arbitrary difficulty?
Shouldn't you use a third-party grader or even a set of graders? Grading is inherently subjective. What you consider a D, another professor might consider a C depending on the rubric, their mood, student quality, etc. And even if we assume no progress in this technology, which seems unlikely, a beta version of a new tech scored a marginally passing grade in an advanced economics course - probably as good or better than a substantial percentage of all college students in the country. That seems pretty amazing to me.
How about blinding the AI’s exam by including it with all other student’s exams for grading? That way, Brian won’t know whether he’s grading a human student or the AI.
Wouldn't the right way to do this be to include the AI test among the exams you actually grade during the semester, without identifying it as an AI test? Grading without knowing the identity of the student who wrote the test is probably good for a variety of reasons (though it can introduce complications if you're dealing with essays that students have worked on drafts of) and would make the test more fair.
You should do this in a blinded way! You likely will grade the AI very differently because you know it is an AI. My old econ teacher used to do this to avoid bias – have students write their name on the back of the last page.
I am a pretty big proponent of AI, and I think it will be transformative sooner rather than later. That being said, I actually like your odds of winning this bet. Getting an A- on 5/6 midterms is a *really* stringent criterion.
The reason why I think the A.I will struggle has nothing to do with intellegence. I don't think Larry Summers would get an A- on 5/6 GMU econ exams if he were to be given them with little context. A big part of getting good grades is knowing the context of the class: what concepts did the teacher emphasize? How much detail is expected? Do you need to use exact jargon or are more colloquial synonyms appropriate?
I think this bet would be more "fair" if the A.I had access to either (a) past exams with solutions or (b) lecture notes/lecture videos for the entire class leading up to the midterm. This would be more comparable to the situtation a student finds them in when taking the exam.
Somebody posted this joke on twitter, apologies I forgot who, but i think it is highly relevant -
I was in the park the other day, and walked past a man playing chess against a dog. "Wow," I said , "That's a smart dog."
"Not that smart," the man replied. "I'm winning 3 games to 1."
Seriously, what % of the population could get a D or higher on an labor econ midterm. Maybe 10%?
For certain tasks the ChatGPT is already outperforming humans (eg some coding tasks, organizing rough notes in to a coherent structure). It's underperforming on internal consistency of answers and general knowledge. But I can't imagine those things won't be fixed in six years.
One thing I don't understand: if Matthew is right, why would he pick the 6 latest midterms from ~2028? If he's right, professors might be forced to change their assignments and midterms by that point. I think you should the 6 latest midterms from today, not from 6 years from now.
Additionally, by allowing "any AI selected by Matthew" does that mean you'd allow Matthew to train an AI on your class lectures and midterms? Because if so, there's a chance ChatGPT could pass right now with the right training.
You miss 100% of the moonshots you don't take - that's the problem with the base rate argument
That said, I think you're correct when it comes to Generative AI
I think ChatGPT was a PR stunt for potentially more valuable but far less flashy use cases, such as B2B automation, data aggregation, the workplace etc.
There is a reason that Microsoft is the biggest investor in OpenAI
My prediction: By 2029, it will be common knowledge that AI aces college exams, in general.
However, Bryan's exams are idiosyncratic enough that the AI might not quite hit this high grading bar, due to being trained on conventional economics textbooks (Krugman etc.) So I think Bryan will win the bet. The AI would need to be trained on his lecture transcripts to avoid this issue.
The real question is whether the AI’s exam performance means anything at all. Studies show very little overlap between what AIs do in school and the skills they actually need on the job.
My very limited experience with ChatGPT is that it will give you a shallow summary of something with a lot of data on the internet without taking much of a side.
That's probably passes the test in many tasks but not all.
Do you need mediocre but cheap answers to things without deep understanding? We've got a Voxsplainer writer in a box!
This is partially because the designers are terrified of it being offensive. They have explicitly said they've tried to make it as inoffensive as possible.
Ok, but I asked it a question about my industry that doesn't touch on race or sex or anything and the output was just as mediocre.
If your goal is to actually identify breakthrough technologies even slightly ahead of the curve, then I don't think it's helpful to apply base rates, for this exact reason. You will always predict “no”, you will be right 95+% of the time, and you will miss every transformative technology until it's too obvious to ignore.
I think AI is on a strong trajectory to be extremely useful, but I'm not sure I would take this bet. “Passing exams” is not an economically useful function (except to students who want to cheat?) and it's not clear to me that AI will be engineered or optimized for this. If you picked something with a clear economic value, like generating marketing copy or writing scripts for TV and movies, I would be much more likely to take the bet.
https://astralcodexten.substack.com/p/heuristics-that-almost-always-work
If you interpret 'apply a 95% negative base rate' as 'just say no to all transformative techs', then of course you're right. But that's not really how one should apply a base rate. You just use Bayes rule, and allow the negative base rate to pre-weight your odds that a given tech will be transformative appropriately low.
Good point, but if you're really seriously doing that then I don't see how you could dismiss everything that AI has just become capable of in the last couple of years. That is an extremely strong trajectory towards some very fundamental capabilities—far more than enough to overcome 19:1 odds.
This boils down to what we mean by transformative, at least in my view. I mean, my personal evaluation is that AI is 90+% likely to be very useful as a tool in many fields by 2030. It's FAR less likely to replace entire fields. I'm not clear exactly what Bryan is estimating here.
This seems like the correct interpretation to me. In any event, ChatGPT (or a similar tech) purportedly has passed assorted medical exams and bar exams. So I don't know what insight is gained by this bet. You can make a test arbitrarily difficult, such that ChatGPT or its future descendants can't pass it, but what does that prove other than arbitrary difficulty?
Since he expects his students to pass it, it can't be arbitrarily difficult.
D is not bad for a guy that didn't attend to your lectures.
Shouldn't you use a third-party grader or even a set of graders? Grading is inherently subjective. What you consider a D, another professor might consider a C depending on the rubric, their mood, student quality, etc. And even if we assume no progress in this technology, which seems unlikely, a beta version of a new tech scored a marginally passing grade in an advanced economics course - probably as good or better than a substantial percentage of all college students in the country. That seems pretty amazing to me.
Caplan's exams also seem hard. His grading seems particularly demanding (as his Rate My Professor reviews confirm).
How about blinding the AI’s exam by including it with all other student’s exams for grading? That way, Brian won’t know whether he’s grading a human student or the AI.
Seems fun, but I don't think Caplan is that biased.
As in it would be fun to see Bryan's reaction to it being an AI.
Wouldn't the right way to do this be to include the AI test among the exams you actually grade during the semester, without identifying it as an AI test? Grading without knowing the identity of the student who wrote the test is probably good for a variety of reasons (though it can introduce complications if you're dealing with essays that students have worked on drafts of) and would make the test more fair.
You should do this in a blinded way! You likely will grade the AI very differently because you know it is an AI. My old econ teacher used to do this to avoid bias – have students write their name on the back of the last page.
+1
This is probably the first Bryan bet I've thought he was way off the mark on. Exciting!
I have a feeling that Caplan will either become an especially hard grader or that he will lose this bet!
Now we need a prediction market on this bet. I'd go for the AI's side, certainly at evens.
I am a pretty big proponent of AI, and I think it will be transformative sooner rather than later. That being said, I actually like your odds of winning this bet. Getting an A- on 5/6 midterms is a *really* stringent criterion.
The reason why I think the A.I will struggle has nothing to do with intellegence. I don't think Larry Summers would get an A- on 5/6 GMU econ exams if he were to be given them with little context. A big part of getting good grades is knowing the context of the class: what concepts did the teacher emphasize? How much detail is expected? Do you need to use exact jargon or are more colloquial synonyms appropriate?
I think this bet would be more "fair" if the A.I had access to either (a) past exams with solutions or (b) lecture notes/lecture videos for the entire class leading up to the midterm. This would be more comparable to the situtation a student finds them in when taking the exam.
Somebody posted this joke on twitter, apologies I forgot who, but i think it is highly relevant -
I was in the park the other day, and walked past a man playing chess against a dog. "Wow," I said , "That's a smart dog."
"Not that smart," the man replied. "I'm winning 3 games to 1."
Seriously, what % of the population could get a D or higher on an labor econ midterm. Maybe 10%?
For certain tasks the ChatGPT is already outperforming humans (eg some coding tasks, organizing rough notes in to a coherent structure). It's underperforming on internal consistency of answers and general knowledge. But I can't imagine those things won't be fixed in six years.
One thing I don't understand: if Matthew is right, why would he pick the 6 latest midterms from ~2028? If he's right, professors might be forced to change their assignments and midterms by that point. I think you should the 6 latest midterms from today, not from 6 years from now.
Additionally, by allowing "any AI selected by Matthew" does that mean you'd allow Matthew to train an AI on your class lectures and midterms? Because if so, there's a chance ChatGPT could pass right now with the right training.
You miss 100% of the moonshots you don't take - that's the problem with the base rate argument
That said, I think you're correct when it comes to Generative AI
I think ChatGPT was a PR stunt for potentially more valuable but far less flashy use cases, such as B2B automation, data aggregation, the workplace etc.
There is a reason that Microsoft is the biggest investor in OpenAI
My prediction: By 2029, it will be common knowledge that AI aces college exams, in general.
However, Bryan's exams are idiosyncratic enough that the AI might not quite hit this high grading bar, due to being trained on conventional economics textbooks (Krugman etc.) So I think Bryan will win the bet. The AI would need to be trained on his lecture transcripts to avoid this issue.
"2. Bryan will then grade the AI's work, as if it were one of his students"
How will you know you'll be fair? Will you accept the 6 manuscripts shuffled in between your students and grade anonymously?