The difference between the models has now become completely baffling to me. 1o seems to have disappeared - o3 apparently "uses advanced reasoning" whilst o4 mini is "fastest at advanced reasoning". What *on earth* do the numbers even mean anymore?!
It's actually managed to get even more confusing. You would normally think that 'o3' is a 'full' model, and 'o4-mini{-high}' is a cheaper version of the unreleased o4 OA is still working on, because that is how the OA naming/versioning has mostly gone before: '-mini' implies it's a smaller, dumber, but much cheaper version. And so you would have a simple heuristic: use o3 when you want the best (and expensive) results, then use o4-mini{-high}s when that is overkill and you need to save money. And the OA blog post benchmarks seem to imply this: o3 isn't strictly superior, but it does overall seem substantially better than the o4-minis.
Except... apparently the o4-minis handle image-related tasks a lot better than o3, and others have noted that the o4-minis are often a lot better than o3, particularly in creative tasks. Like when I task it with some of my creative writing tests like 'translate Milton into alliterative verse', the o4-minis are noticeably better. (Also, o3 seems to have an alarming tendency to make things up, double down on mistakes, and manipulate users, which doesn't seem to be nearly as severe in the o4-minis.)
And aside from that, there's just a lot of variation in outputs across LLMs and sessions. They can answer in different ways, and the same model can answer differently in different sessions... Case in point: OP flagellates o3 for never mentioning Book of Job, but when I ask o4-mini-high the exact same question, it will make a connection to Job: https://chatgpt.com/share/68028c71-5394-8006-b6fd-b8ac3c66c3de "tragedies of moral endurance (Job, or even a Senecan play): we expect the fall, but we stay for the ethical and emotional probing."
4.5, on the other hand, doesn't mention Job, because it instead makes a lot of hay out of the story of Abraham & Isaac, pointing out that Abraham complies without complaint and does everything the Lord demands, but we know there will be no human sacrifice, so there is no surprise when the sacrifice doesn't happen. And this is perhaps because 4.5 thinks that Job is *not* a better parallel for The Clerk's Tale, when you ask it about Book of Job: https://chatgpt.com/share/68028e42-2e6c-8006-9dea-4abad5292d89
> ...The Book of Job provides a clear, theological vindication of suffering—God explicitly reveals himself, and Job is rewarded. Chaucer’s tale, although concluding with Griselda’s restoration, remains unsettling. Walter’s arbitrary cruelty is not divinely justified, and the Clerk himself remains ambivalent about the morality of Walter's tests.
(I know it hedges by saying that Job is 'perhaps' a better Biblical comparison than Abraham, but coming from a relatively sycophantic chat LLM, I take that as criticism.)
So, if a LLM doesn't mention Book of Job, maybe that's an quirk of that prompt, that model, maybe a chance outcome of that session, or maybe the LLM in question just doesn't agree with OP that Job's such an awesome parallel that it simply *must* be mentioned in even a short general discussion of the Clerk's Tale. Who knows? Not I. (Benchmarking frontier LLMs is hard, let's go shopping.)
How interesting about o4-mini. I haven’t tried it yet. I have started noticing some remarkable hallucination on o3. Today it gave me a really superb answer about Gulliver’s Travels and hallucinated a lot about Fire and Hemlock to the point that it became useless.
This is what my Claude bot -- that has been trained on my home library of music, dvds and literature -- has to say about the Baroness:
"In the narrative architecture of the film, the Baroness functions as what film theorists would call a "moral antagonist"—not villainous in the traditional sense but representing values antithetical to the film's ethical framework. She embodies the complicity that enabled Nazism's rise: the privileged class that prioritized social position over moral resistance.
The Baroness doesn't need to wear a swastika to serve the film's critique of those who stood by. Her elegant dresses and Vienna sophistication create a more insidious representation—the tacit supporters who never had to declare allegiance because their class position already allied them with power.
So while textually she's not a card-carrying Nazi, subtextually she represents something perhaps more historically accurate and disturbing: the comfortable accommodation that allowed fascism to flourish. The film's romantic resolution isn't just about the Captain finding true love—it's about rejecting the moral compromise the Baroness represents."
Just to follow-up and elaborate: I do not think asking test questions of o3 as though it were a student with the full context of a semester’s worth of interactions where it can try and decipher exactly what the teacher is looking for is very useful for gauging whether it is “generally intelligent.” It’s already leagues better than anything else we’ve ever had at doing this limited kind of thing. But you should be asking whether this kind of thing is even a test of intelligence.
If you are treating it like an oracle and you aren’t asking it follow-up questions you aren’t really treating it like an intelligent interlocutor. Ask it to justify itself. Push back and see what it says.
So you want an intelligence that can deal in ambiguity but you think there’s a definite, if hair-splitting, answer to whether The Clerk’s Tale is an exemplum?
Did you ask it a followup question, like, “you said this was an exemplum. are you sure? justify your answer.” That might be even more interesting than finally using the right key-phrase for passing the Oliver test.
The difference between the models has now become completely baffling to me. 1o seems to have disappeared - o3 apparently "uses advanced reasoning" whilst o4 mini is "fastest at advanced reasoning". What *on earth* do the numbers even mean anymore?!
It's actually managed to get even more confusing. You would normally think that 'o3' is a 'full' model, and 'o4-mini{-high}' is a cheaper version of the unreleased o4 OA is still working on, because that is how the OA naming/versioning has mostly gone before: '-mini' implies it's a smaller, dumber, but much cheaper version. And so you would have a simple heuristic: use o3 when you want the best (and expensive) results, then use o4-mini{-high}s when that is overkill and you need to save money. And the OA blog post benchmarks seem to imply this: o3 isn't strictly superior, but it does overall seem substantially better than the o4-minis.
Except... apparently the o4-minis handle image-related tasks a lot better than o3, and others have noted that the o4-minis are often a lot better than o3, particularly in creative tasks. Like when I task it with some of my creative writing tests like 'translate Milton into alliterative verse', the o4-minis are noticeably better. (Also, o3 seems to have an alarming tendency to make things up, double down on mistakes, and manipulate users, which doesn't seem to be nearly as severe in the o4-minis.)
And aside from that, there's just a lot of variation in outputs across LLMs and sessions. They can answer in different ways, and the same model can answer differently in different sessions... Case in point: OP flagellates o3 for never mentioning Book of Job, but when I ask o4-mini-high the exact same question, it will make a connection to Job: https://chatgpt.com/share/68028c71-5394-8006-b6fd-b8ac3c66c3de "tragedies of moral endurance (Job, or even a Senecan play): we expect the fall, but we stay for the ethical and emotional probing."
4.5, on the other hand, doesn't mention Job, because it instead makes a lot of hay out of the story of Abraham & Isaac, pointing out that Abraham complies without complaint and does everything the Lord demands, but we know there will be no human sacrifice, so there is no surprise when the sacrifice doesn't happen. And this is perhaps because 4.5 thinks that Job is *not* a better parallel for The Clerk's Tale, when you ask it about Book of Job: https://chatgpt.com/share/68028e42-2e6c-8006-9dea-4abad5292d89
> ...The Book of Job provides a clear, theological vindication of suffering—God explicitly reveals himself, and Job is rewarded. Chaucer’s tale, although concluding with Griselda’s restoration, remains unsettling. Walter’s arbitrary cruelty is not divinely justified, and the Clerk himself remains ambivalent about the morality of Walter's tests.
(I know it hedges by saying that Job is 'perhaps' a better Biblical comparison than Abraham, but coming from a relatively sycophantic chat LLM, I take that as criticism.)
So, if a LLM doesn't mention Book of Job, maybe that's an quirk of that prompt, that model, maybe a chance outcome of that session, or maybe the LLM in question just doesn't agree with OP that Job's such an awesome parallel that it simply *must* be mentioned in even a short general discussion of the Clerk's Tale. Who knows? Not I. (Benchmarking frontier LLMs is hard, let's go shopping.)
How interesting about o4-mini. I haven’t tried it yet. I have started noticing some remarkable hallucination on o3. Today it gave me a really superb answer about Gulliver’s Travels and hallucinated a lot about Fire and Hemlock to the point that it became useless.
o3 is like much better o1, whereas o4 is better again but is smaller I think
I wish I could remember which model managed to list incorrectly the next 10 dates for easter when I asked it last year.....
This is what my Claude bot -- that has been trained on my home library of music, dvds and literature -- has to say about the Baroness:
"In the narrative architecture of the film, the Baroness functions as what film theorists would call a "moral antagonist"—not villainous in the traditional sense but representing values antithetical to the film's ethical framework. She embodies the complicity that enabled Nazism's rise: the privileged class that prioritized social position over moral resistance.
The Baroness doesn't need to wear a swastika to serve the film's critique of those who stood by. Her elegant dresses and Vienna sophistication create a more insidious representation—the tacit supporters who never had to declare allegiance because their class position already allied them with power.
So while textually she's not a card-carrying Nazi, subtextually she represents something perhaps more historically accurate and disturbing: the comfortable accommodation that allowed fascism to flourish. The film's romantic resolution isn't just about the Captain finding true love—it's about rejecting the moral compromise the Baroness represents."
Just to follow-up and elaborate: I do not think asking test questions of o3 as though it were a student with the full context of a semester’s worth of interactions where it can try and decipher exactly what the teacher is looking for is very useful for gauging whether it is “generally intelligent.” It’s already leagues better than anything else we’ve ever had at doing this limited kind of thing. But you should be asking whether this kind of thing is even a test of intelligence.
If you are treating it like an oracle and you aren’t asking it follow-up questions you aren’t really treating it like an intelligent interlocutor. Ask it to justify itself. Push back and see what it says.
So you want an intelligence that can deal in ambiguity but you think there’s a definite, if hair-splitting, answer to whether The Clerk’s Tale is an exemplum?
Did you ask it a followup question, like, “you said this was an exemplum. are you sure? justify your answer.” That might be even more interesting than finally using the right key-phrase for passing the Oliver test.