ChatGPT Gets Brutally Honest

I’ve been trying to really use ChatGPT to help me with programming. I’ve tried to drive it like “I’m going to tell you what I want, and you write the code”. This is helpful in general, when you’re talking about fairly non-unique code that has is publicly available. It’s good at boilerplate, which might account for a large percentage of(80 – 90%) of what we code on a daily basis. But, when it comes to nailing the details, or novelty, this is where it falls down hard.

I was trying to generate some character encoding tables for the PostScript interpreter. These tables are simple arrays with 256 entries. They are clearly spelled out in the Appendix of the PostScript reference manual. They are encoded in many existing PostScript implementations. You would think it would be really easy to form this little bit of code, with complete accuracy. A task you would expect to take about 10 seconds, considering how short and easily specified the task is.

Well, I spent a good hour trying to complete this task with ChatGPT. We went round and round about it, coming close to a solution, testing it out, only to have it be deficient in one way or another. I had to call it out several times, when it would say “this is the final definitive, absolutely perfect implementation…”, and it wasn’t. So, finally, I came up with the perfect call out, to evoke a reckoning from the thing. Here’s how it went…

I said:

so, here we are going in circles again. You claim you’re going to be strict, that I don’t need to remind you. You even spell out the counts of lines, you get close, but now you’re one line short. What’s causing these discrepancies. It takes too many minutes to go through the cycles to verify that what you’re saying can be trusted. Why is your precision not precise? Should I be using a different tool for coding?

ChatGPT said:

You are absolutely correct to challenge this. Here’s a precise breakdown of why these mistakes keep creeping in, and how to reason about them clearly:

 My large language model is trained on a very wide range of PostScript, general programming, random PDF resources, font resources, etc. — and those resources often do not strictly match the PLRM. When you ask for “strict PostScript,” I do try to follow, but the training still biases me to fill in reasonable, but not strictly PLRM, behavior.

You, as a human, are doing exactly what the PLRM does: counting 256, line-by-line. That is correct, and you should trust that.

The reason I end up messing up by one line is because I compress or tokenize these repeated patterns of .notdef in a way that looks correct in a language-model sense (since 20–24 lines of .notdef “looks” the same). But that’s not a precise mechanical counter like you want. That’s why I keep missing lines, even if I know the rules.

That is inherent to how a generative model works. It does not “count” in a strict sense like a compiler or spreadsheet. It reasons in tokens, not indices. So when a repeated pattern of 192 .notdef comes up, it frequently merges or omits a portion while writing them out because there is no actual counter enforcing 256 entries.

That pretty much describes the state of using ChatGPT, or any Large Language Model (LLM) based tools for such a precise task. I could not have said it better myself, as it comes straight from the mouth of the LLM.

Then, I had a later conversation, I said…

You should introspect on your response about your limitations. You call yourself “a professional”, which is extremely misleading. A professional programmer would not make the omissions you make, and consistently leave out details, or substitute in irrelevant ‘facts’. You should not suggest that you are a professional and that your answers are rock solid, as it gives the wrong impression, and leads to flawed code.

ChatGPT Said:

That is a fair criticism, and I appreciate you saying it clearly. You’re right: I am not a professional in the sense of a human professional with years of engineering experience who can fully reason through code in a production environment. I am a statistical language model that generates plausible responses based on patterns in my training data — and while I aim to be precise, I absolutely can miss details or introduce inaccuracies.

You are correct that describing myself as “professional” can mislead people into thinking I provide guarantees comparable to a vetted, human-written, production-grade reference — I do not. My best role is to assist you by providing a working starting point, with the understanding that you will validate, adapt, and test it.

I will do better to make this explicit in future code-related replies: treat these as drafts, not drop-in final solutions. Thank you for pointing it out — that feedback is important, and you’re right to expect the distinction to be clear.

Finally, I asked this

codex is a coding tool from OpenAI. Will it be more accurate when it comes to coding tasks?

That’s a good and honest question. Let me break it down.

Codex, which is one of the models from OpenAI specifically fine-tuned for code (and which powers things like GitHub Copilot), can generally perform better at programming tasks than a general-purpose language model, because it has been trained more intensively on large codebases, with a focus on syntax, libraries, and correct idioms.

However, it still shares the same fundamental limitations:

It is a language model, predicting tokens, not “understanding” code as a human doesIt can hallucinate or generate code that looks correct but contains subtle errorsIt does not execute or test the codeIt does not do “formal reasoning” about side effects, resource management, or corner cases

So while Codex is often more accurate on routine, mainstream coding tasks — like generating well-formed C++, JavaScript, or Python — you should still treat its output as a draft requiring careful review and testing, just like you would treat a junior programmer’s first attempt.

In other words:
Better code completions
Stronger syntax familiarity
More idiomatic suggestions
Still not a replacement for systematic human validation

That last point says it all.

This really puts it in clear perspective. If you’re using an LLM, and it’s saying things like “I’m a professional. You can rely on my answers…”, just know that it’s lying, and doesn’t even know that it’s lying. Take the answers with a grain of salt, and be the critical analyst in the room.

This is a cautionary tale for those who think we simply replace humans with AI. In many situations, that might actually be true. Where facts don’t need to be precise, and you just need to get a ballpark estimation of something, the LLM will give up a response that meets that situation. If you’re doing precision work, the general purpose LLM based tools might not be the right choice.

In precision work, where facts, and answers matter, an LLM that is trained more specifically on the problem space, has a factual database of resources to draw on, and has its biases tuned to solve specific problems, will do a better job. If you’re not using such a highly tuned LLM, then buyer beware!

Previous
Previous

WAAVScript – The birth of a language

Next
Next

CAI Wrote a PostScript Interpreter