I went on The Booking Club podcast. We recorded at The Cheshire Cheese and talked about late bloomers, including Samuel Johnson. Spotify. Apple. Acast.
Several people I know think that the new model from OpenAI, o3, is AGI. Who am I to disagree? But the current way of assessing whether we have AGI is through our own impressions. There is no benchmark. So here are some of my early impressions.
I’m not, as you know, a downer on AI capability, but I don’t think o3 is what I expected AGI to be, noting that I am very impressed like everyone else. The progress is real! I just don’t think this is AGI yet.
Every time there is a new model, I ask it a few of my standard questions so I can compare. My favourite is this A-level question from the 1970s, to which you could give either a fairly mundane answer or a very very impressive answer indeed.
“Whatever else its merits, The Clerk’s Tale does no achieve its effects through surprise.” Discuss.
Overall, o3 gave a great response, better than any other answer I have had to this question from an LLM. But there are problems.
It calls The Clerk’s Tale an “exemplum”, which is not quite right. Exemplums were tales told within a sermon—tales which are used to demonstrate one particular moral, such as “money is at the root of all evil”. In Chaucer, The Friar’s Tale and The Pardoner’s Tale are often called exemplums, but not The Clerk’s Tale. Some call it a moral aplogue. Here is what Harvard has to say,
Most interpretations of the tale assume it is a “religious fable,” as Petrarch seemed to believe. The tale is taken as purely symbolic and Griselda is regarded as a type of Job. Yet there are suggestions of depth to the characters of Walter and Griselda that make it difficult to dismiss her as merely a symbol of Christian patience in the face of adversity.
You might think I am quibbling, but if an LLM was this loose with terminology in other fields, it would be a much clearer case of failure. Whether or not “exemplum” is the right word is a large part of the question being asked… Also, Job is, as always, not mentioned by o3. This is a major oversight, not a quibble. At what point does the model need to know the most basic relationship of this tale in the Bible before the G in AGI becomes undeniable?
Again, my overall impression is good! When I asked it the All Souls exam question about the Wife of Bath, it had an excellent response, though I don’t know whether All Souls would admit anyone who gave this answer. Likewise, o3 understands why Prospero repents so suddenly. It’s getting better at the “was Baroness Schraeder a Nazi?” question too. The commentary it wrote of sonnet 27 was good, but hardly PhD level I think.
When it comes to writing poetry, though, o3 is still far behind the standard for AGI. I asked for a quatrain in slant rhyme and got this.
Moonlight drips through lattice, thin as mist,
echoing footsteps graze the quiet dust;
morning waits, a clock whose gears insist
on turning dreams to silver‑flecked rust.
There are many examples here and most of them are flat out failures, even though o3 knows about slant rhyme.
If you click on that link, you’ll see at the bottom that I dropped a line in from an old ballad and it explained it perfectly. For that sort of thing, o3 is very impressive. But assessing whether we have AGI has to take some of these points seriously: AGI has to involve being able to deal with humanistic ambiguity and complexity. It has to involve being able to rhyme reliably in assonance, no?
The smarter the model is claimed to be, the easier it should find it to tell me that the Baroness is sort-of a Nazi and that The Clerk’s Tale isn’t quite an exemplum (or to make a strong case for its claim, with proper context). And it should be able to do slant rhymes! School pupils can do that if you teach them!
What does the general mean in general intelligence if all of this is deemed to be hair-splitting?
Finally, I will let you judge whether it does a good job of writing like me…
This is my previous piece about AI as literary critic.
How good is AI at literary criticism?
Until recently, although I was excited about LLMs, I thought they were pretty hopeless literary critics. (And the poetry they write was—and is—the purest dross.) While it was doing well on maths and law exams, it made little progress on English exams.
And here is my piece about AI and taste.
Will AI have a taste all of its own?
Humanities people need to use AI if they want to have good opinions about it
The difference between the models has now become completely baffling to me. 1o seems to have disappeared - o3 apparently "uses advanced reasoning" whilst o4 mini is "fastest at advanced reasoning". What *on earth* do the numbers even mean anymore?!
This is what my Claude bot -- that has been trained on my home library of music, dvds and literature -- has to say about the Baroness:
"In the narrative architecture of the film, the Baroness functions as what film theorists would call a "moral antagonist"—not villainous in the traditional sense but representing values antithetical to the film's ethical framework. She embodies the complicity that enabled Nazism's rise: the privileged class that prioritized social position over moral resistance.
The Baroness doesn't need to wear a swastika to serve the film's critique of those who stood by. Her elegant dresses and Vienna sophistication create a more insidious representation—the tacit supporters who never had to declare allegiance because their class position already allied them with power.
So while textually she's not a card-carrying Nazi, subtextually she represents something perhaps more historically accurate and disturbing: the comfortable accommodation that allowed fascism to flourish. The film's romantic resolution isn't just about the Captain finding true love—it's about rejecting the moral compromise the Baroness represents."