58 Comments

For a few years now, AI sceptics have argued "well it can answer question A, but it still gets harder question B wrong", ignoring that six months ago it couldn't answer A either and it's the direction of travel that's important. It feels like we are now beginning to run out of room to make the questions harder (unless it's to ask questions that humans can't answer either); and the rate of AI improvement shows no sign of slowing down.

Expand full comment

Now do this for self-driving cars in 2016.

I'm not making a prediction here! I have very little sense of when the large language model S-curve will change slope. But I think it's pretty clearly the case that the end of progress can come very suddenly.

Expand full comment

Self driving cars are already safer then the average driver(not saying a lot tbh) the main reason we don't see more is government regulation.

Expand full comment

This is just not true (or perhaps it has become true very recently, I don't have real-time information about Waymo's results or anything, but it wasn't true say a year ago). It's something that people who like their simple narratives on AI progress tell themselves as cope.

Expand full comment

Are you aware that Waymo cars are being actively used as an autonomous taxicab service in the Bay Area and a few other places?

Expand full comment

I sure am!

I'm also aware that they can't be used as an autonomous taxicab service almost anywhere, and that Waymo is not pushing hard for expansion.

So, look, there's certainly some nuance here. Waymo's cars may well be safer than the average driver, *in places where they have extremely good inch-by-inch mapping data*, and *in some weather conditions*, and *while driving in somewhat restricted ways that don't express the range of driving that normal people do* (such as taking unprotected left turns and going around double-parked cars and so forth). And that's legitimate. I think we can round that off to "not safer than the average driver," but if you want to express that as "safer than the average driver but not well-suited to driving in all the places and conditions where the average driver can," then that's cool too.

But what's very clear is that in 2016 or so, we'd seen these great strides since 2007, where each year autonomous cars got vastly better than the previous, and where a straight-line extrapolation put full-self-driving in like 2018 or so, maybe skeptically 2020. And then, just as abruptly as that progress started, it flattened way the hell out. And it has not been because the government jumped in their way.

Expand full comment

This kind of just glosses over the fundamental fact that we don't have a way to measure the improvements here like we do for other things like LLM's.

This very article you're commenting on has a nice benchmark. What's the equivalent for self-driving cars?

Expand full comment

You couldn't have in 2016, though, because there were no meaningful benchmarks or problem sets you could run a Waymo car on. A Waymo car's ability in 2016 was... '???'. A Waymo car's ability in 2023 is... '???'. (And if you think that DL benchmarks like MMLU are flawed, wait until you look at the California numbers like 'miles per disengagement' everyone is forced to use because there's literally nothing else!) This has been one of the biggest frustrations in following self-driving car progress. There just aren't any relevant benchmarks you can even begin to extrapolate on. They surely exist *internally*, but self-driving car companies are extraordinarily opaque, and what numbers they release tend to be either stripped of any context which would make them meaningful or actively deceptive (looking at you, Tesla).

There is no comparison here with language models which have lots of suites and benchmarks, excellent scaling curves on relevant properties, active prediction markets & forecasters on them, and so on. Thanks to all that, we can say that there are no indications of an S-curve kicking in (note for example the beautiful fit OA shows for GPT-4, with no 'bounce' indicating an unexpected flatlining or divergence from the projected loss).

Expand full comment

This strikes me as a lot of handwaving and cope. Do we need something that we can perfectly plot on a graph here?

In 2007 (or 2006 or something, I forget), autonomous vehicles couldn't navigate through the open desert and reach a finish line. The idea that they could be on a road with people was laughable -- it was obvious they'd kill everyone and then themselves, basically instantly. In successive DARPA challenges, they started finishing the course, then finishing harder courses, going through simulated traffic.

By 2013, we had cars that could, in certain situations, safely drive in traffic (on freeways). By 2015, we had cars that could handle (a subset of) urban traffic situations, probably not actually safely compared to human drivers, but like not two orders of magnitude worse or anything. By 2018, Waymo launched autonomous vehicles to non-employees in Scottsdale. And then... we've inched forward. We have a small fleet driving, almost certainly deeply unprofitably, in San Fancisco. The Scottsdale area has increased a bit.

This clearly was a surprise to companies working in the autonomous vehicle space. Their internal metrics didn't give them any better prediction that this slowdown was coming.

Does this mean that LLMs will suddenly have a giant slowdown in progress post GPT-4?

It absolutely does not.

Does this mean that people should rein in their confident predictions that LLMs will increase steadily with no end in sight? It does.

Expand full comment

This is true, but I think if a technology is showing consistent progress over time, the null hypothesis should be that the progress should continue, and the burden of proof should be on people who think it will stop to give reasons why this should be the case.

Sometimes there are good reasons. Moore's Law held true for decades but we're now running up against a limit imposed by the discrete/atomic nature of matter itself. But I can't think of a good reason why LLM progress should suddenly stop any time soon.

Expand full comment

I don't think there is a burden of proof! We aren't in a courtroom, and I suggested a bit of humility in your predictions, I didn't say your wife was ugly.

There is clearly a lot of team thinking here. Like, "Oh, you have to prove that my side is wrong, or else we're right." But there aren't actually sides. I suggest that you shouldn't identify with team "AI will progress quickly."

Expand full comment

Chains of thought! Forcing the model to give the "TRUE" or "FALSE" response first, robs it of any chance to actually use reason in working out its answer. Instead, I recommend something like prompting the AI to give: - a set of relevant points, - further inferences that can be made from those points, - THEN an explanation leading it up to its ultimate answer, - THEN followed by the actual TRUE or FALSE.

This may seem like a lot of effort, but keep in mind that the AI does not have a consciousness or thought as we do; if we want it to actually "think about a problem", the thinking has to take place inside the text it outputs. Even students get to use a scratch pad.

Expand full comment

Wait I thought that the bet was inflation adjusted! That's a really unfair bet!

Expand full comment

It's only unfair to Bryan. If he won, he would only get the payoff at the end.

Expand full comment

In the bet post, Bryan says this:

"Matthew will prepay the $500 in January 2023; the preceding terms have been pre-adjusted to compensate Matthew for expected inflation. "

So according to this, the bet does adjust for inflation, so I'm not sure why Bryan says here that it isn't . And it is Matthew who is getting the payoff at the end.

Expand full comment

He's pretty clearly saying they factored in inflation when making the odds of the initial bet that way they don't have to do some sort of weird 'true-up' based on CPI at the end or something.

Expand full comment

So it's not an equal odds bet? I think that should have been mentioned in the original post more explicitly. Also, it is better to disentangle the odds from inflation as the odds are supposed to provide an idea about the belief of the bettors.

Expand full comment

Don't know what to tell you, it was clear to me and apparently to the bettors.

It wouldn't be that hard to reverse-engineer implied inflation rates and the effective odds of the bet -- either party could have hedged were they so inclined.

Expand full comment

In this case the odds depend on the future inflation, which is what makes it weird. Like if the US undergoes higher than expected inflation within this decade, Matthew loses money even though the bet had nothing to do with the US economy. I'm not sure how you think this isn't weird.

Expand full comment

I mean, he is not a Professor of Economics for nothing...

Expand full comment

I was curious how much of these results are being driven by the perfect score responses to "what did notable person X say about subject Y", since those are questions where GPT has a predictable advantage over the typical student with a flesh-brain. Replacing the scores for those questions with average score of the other T/F/E questions gets 63.5, still an A, but a more marginal one. Still doesn't augur well for your bet, but I'd buy derivatives of your bet up to 20 cents on the dollar from 10-15, up to 30 if you wanted to maximize your chance of winning by dropping "what does X think about Y" questions from future exams.

Expand full comment

I want to know what happens to students if you write an exam that GPT does badly on. Use more complex examples that don't all have a similar form or (for quantitative problems) aren't carefully designed to have simple nice answers.

I bet you end up with an exam that is much better at differentiating students who actually have a conceptual understanding to those which have basically done what GPT is doing here: kinda learned the basic problem forms and to pick out what the instructor is testing but don't necessarily have any ability to apply thaf in the real world.

Expand full comment

Dang, that sucks about your bet. I think you probably could have predicted this differently though. John Carmack, for example, was talking about how strong LLMs are and he is usually much more sober than most people with technology related predictions.

Now that you've lost, what are your predictions for AI capabilities? I think that they're likely to automate large swaths of knowledge work, but I'm not sure when we reach the point where very little knowledge work is done by humans. I would guess that happens within 40 years, but I'm not sure why I guess that time specifically. It could happen a lot sooner, but I don't think that exponentials can continue

Expand full comment

When you made the bet I commented it was dumb. Basically with the right training data, GPT 3 could have passed your test. So the bet has always been is the right data available to answer your questions, and is the AI smart enough to make use of it. The answer to both for this test is clearly yes. You can design a test the AI doesn't have data to answer, or a test that exploits a common mistake the AI makes to ensure the wrong answers. So you can win the bet if you choose, but you lost the war. The AI can pass a test like this at this point the mechanical chance you'll win is not worth the time just concede and pay it out. You've lost, learn from it.

Expand full comment

https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks

"As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.

In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation."

It sounds like it trained on questions that are more similar to your exam questions. Have you tried asking for similar exam questions to the ones you gave to try to determine what it used for training? Another thing to try would be to ask new questions to test similar concepts.

Expand full comment

I am curious how you would grade this answer to question 5:

FALSE. The part about externalities is correct, since by definition, don’t affect private returns, so selfish students won’t care about them. Severe credit market imperfections, however, even if we presume they only result in too little credit rather than too much, need not make you more eager to continue your education either. If the goal of education is to increase your human capital and social capital, but imperfections make it more difficult to access financial capital, then spending down your financial capital and access to credit is more costly, and can stop you from having the complements necessary to profit from your education. One cannot assume that credit market imperfections will impact student loans or that loan rates reflect expected returns to education, since the student loan market is heavily subsidized by the government, so this does not provide strong evidence that returns to education are likely to be higher. One possible way, however, that excess returns could be implied is if, lacking good information on borrowers, those providing loans are using education as a proxy measure when giving out loans. In that case, continuing one’s education would provide additional access to more credit on better terms, which is valuable, increasing returns to education.

Expand full comment

You were not my professor but I (more or less) remember virtually all of these questions in some form of another. The set of training data for basic economics across the web (nevermind all the “helper/tutor/cheater” sites) is breathtaking, widely distributed, and essentially syndicated. I’m not surprised at all. Yet, this is not intelligence - this is distribution of intellegence (or otherwise). That said, the culmination of the worlds syndicated data will lead to increasing reliance (business models) on “good-enough” answers to many (most?) regular requirements across a very broad range of industries. People worry about the Red Herring known as AGI and, the spend time worrying about “alignment” as if it’s even an applicable word/concept other than theoretical modles that left the station long ago. Alignment without understanding the power of markets is an academic exercise. Don’t worry about alignment, worry about markets. Don’t worry about AGI with “good enough” narrow AI is already here. We have arrived. Where we are going is a fun but useless exercise.

Expand full comment

> Yet, this is not intelligence - this is distribution of intellegence (or otherwise)

I think the interesting question that we'll be discovering answer to in the next few years is how much current human "intelligence" actually is really some form of regurgitating patterns and applying them to semi-novel conditions. I suspect that encompasses a large part of what we think of as intelligence. Given a goal, identify a set of known techniques that chart a path toward that goal. It leaves out creativity - except that much creativity can also basically be simulated and recombinations of existing content and patterns. It moves the goal post of intelligence to "novel insights and highly creative solutions"

Expand full comment

We love modles. Some people work diligently to establish, but mostly, to extend models. We are a model people. Models become tokens. That’s ultimately who we are. In this sense, the language modles seem so natural. Just imperfect. The amazing thing is the clarity of the moment: we can all agree that the language models are imperfect. That agreement wont last too long, regardless how imperfect they will always be - that is - unless we (as in some universities, religions, etc) begin to believe the modles are truth itself.

Expand full comment

Interesting that it got the lowest score on what was, in my opinion, the easiest question (licensing requirements & prices).

Expand full comment

I think the AI didn't like the use of "definitely" (or maybe that's just me, because it set my teeth on edge). That assumes the absence of any other outside factors, because lots of things could possibly happen that would have the overall result of lower wages. The question should reflect that (or maybe there's a universal rule for the test about assuming nothing else changes, which would help). Or the question should say probably, or it will cause upward pressure, etc.

Expand full comment

I have played with this question a lot recently, and it is the "definitely" that GPT-4 gets hung up on. It will endlessly argue that there are many, many factors that could make it not happen, so you can't say "definitely". Which is fair!

If you simply change the wording to "almost certainly" then it gets the question correct with the correct reasoning.

Interestingly, a second way to make it answer True, while retaining the "definitely" wording, is to prompt is to answer "according to the principles of classical economics". This gets it to discard all these other nuances that it's worried about and just think about the supply/demand stuff.

Expand full comment

I think this is why Tyler has been bullish on Chat-GPT, he saw this exponential when it was first released

Expand full comment

Has this bet been conceded yet? (has it been run again with all 6 exams?)

Expand full comment

Could the success of GPT-4 simply be a result of you posting the test results, answer key, and discussion of the correct answers?

Expand full comment

For the next exam, do you mind sharing where we can read certain passages? I have The Accidental Theorist and probably whatever Lansdburg's passage is from (unless it's from an article). However, I really didn't know the context of the passages so it felt really difficult to answer them correctly. I got Krugman answer correct, but as you know being besties with a philosopher, it may not have been a justified belief.

Expand full comment

> "The most natural explanation to my mind was that my blog post made it into the new training data, but multiple knowledgeable friends assure me that there is no new training data."

Bryan, are you 100% sure about this? I've seen ChatGPT reference things after 2021. Example confirmation: https://twitter.com/tszzl/status/1638655356346957824

They might well be correct that your blog post is entirely separate from the training data. But I've heard "the training data ends in 2021" point referenced a few times, and at least in that simple form, it doesn't seem to be the full story. And the results of a bet like this absolutely hinge on the precise details.

Expand full comment