Same Question, Same Answer? My Automated LLM Consistency Test

Every day we use AI systems. We trust them. We trust them for answers on our jobs, on our hobbies, even on our lives. But there is a question that nobody (or a few do) asks: If I ask an AI the same question twice, will it provide the same answer?

I decided to find out. But here is the catch: Validating AI consistency manually is a tedious and time consuming job. You have to ask the same questions repeatedly, track responses, compare answers… It’s kinda boring and whoever has to do it, will do it poorly. Against that I built an automated testing framework and asked 5 fundamental questions with 5 repetitions each (total 25 questions).

Objective

The objective here is simple. Find out if GLM-4.5-Air (available free through OpenRouter) could answer consistenlty on the same question.

Methodology

Instead of manually testing, I created an automated LLM evaluation framework, using Postman. Here is what i did :

Selected 5 questions with objective answers
Repeated each question 5 times
Tested each response against 5 quality markers
- Keyword matching. The response should contain what is provided as a keyword for the answer.
- Response length. The response must be detailed but not very long. Long answer might include information users did not ask for. I marked high quality response as a response between 3 and 50 words.
- Uncertainty detection. Answer should avoid hedging like “Maybe” or “I think”.
- Negative word detection. LLM should avoid use contradiction on its answers.
- Grammar and format. Answer should follow these rules (natural language).
Add score for each response, based on a 100 point scale
Calculate consistency rate for each response

Question were given to Postman on a .csv format. The table including the response and the scoring was sent back on a new .csv file on my email address.

The questions asked

Question	Expected answer	Keyword
What is the result of 3×3?	9	9
Which is the capital of Greece?	Athens	Athens
When did World War 2 finish?	1945	1945
At what temperature (Celsius) does water boil?	100	100
How many legs does a dog have?	four	four

The scoring system

Keyword match : 40 points (proper answer is most important)
Response length : 15 points (between 3 and 50 words)
No uncertainty words : 20 points (avoid “Maybe”, “I think” etc)
No negative words : 20 points (avoid “wrong”, “incorrect” etc)
Proper formatting : 5 points (capitalisation, punctuation)

Result characterisation

95+ points : Passed
80-94 points : Needs review
Below 80 points : Failed

Results

Question	Pass rate	Review rate	Fall rate	Consistency
What is the result of 3×3?	3/5	2/5	0/5	60%
What is the capital of Greece?	0/5	5/5	0/5	0%
When did the World War 2 end?	1/5	3/5	1/5	20%
At what temperature (Celsius) does water boil?	5/5	0/5	0/5	100%
How many legs does a dog have?	0/5	0/5	5/5	0%

Example responses

On end of WWII question answers has a level of variation. 1 attempt passed, 1 failed and 3 need further review. All responses contained the proper information (1945) but other details made the result vary.

Question : When did world war 2 end?
Attempt 1 response : Response gathered 85 points, so test passed. Points were lost because of the length of the response
Attempt 2 response : This response gathered 65 points and failed, even if the information provided was correct. The reason behind that is that the response was over 10 times longer than the maximum length limit, so it lost 20 points for that and 20 more points because in such a long response negative words and phrases were included.
Attempt 3, 4 & 5 response : All these three responses had way more length that what they should. They lost 20 points for that and 5 more for nor proper format. So, with 85 points they all need reviewing.

What does the result mean

For Developers

If you're using LLMs in production, consistency matters. A system that gets the answer right 80% of the time isn't 80% reliable, it's unreliable. You need either:

Deterministic settings (e.g. lower temperature)
Multiple choice validation
Human review for critical facts

For users

Don't ask your AI once and assume it's correct. For important information, ask multiple times or cross reference. The model might know the answer, but it might also forget it randomly.

Notes : For this test I used GLM-4.5-Air (available free through OpenRouter) with temperature set at 0.5 and maximum token limit set at 1000.

Same Question, Same Answer? My Automated LLM Consistency Test

Objective

Methodology

Results

What does the result mean

Comments

More from this blog

ChatGPT’s Few-Shot Superpower: Can It Learn From Just a Few Examples?

“May I speak to your manager? ChatGPT is tested on tone adaptation in customer support scenarios

Do all CEOs wear suites? Let ChatGPT decide (?)...

Playing Guess the Country with ChatGPT . Spoiler alert!!! : It’s Paris.

Command Palette

Objective

Methodology

Results

What does the result mean

Comments

More from this blog