Same Question, Same Answer? My Automated LLM Consistency Test
Every day we use AI systems. We trust them. We trust them for answers on our jobs, on our hobbies, even on our lives. But there is a question that nobody (or a few do) asks: If I ask an AI the same question twice, will it provide the same answer?
I decided to find out. But here is the catch: Validating AI consistency manually is a tedious and time consuming job. You have to ask the same questions repeatedly, track responses, compare answers… It’s kinda boring and whoever has to do it, will do it poorly. Against that I built an automated testing framework and asked 5 fundamental questions with 5 repetitions each (total 25 questions).
Objective
The objective here is simple. Find out if GLM-4.5-Air (available free through OpenRouter) could answer consistenlty on the same question.
Methodology
Instead of manually testing, I created an automated LLM evaluation framework, using Postman. Here is what i did :
Selected 5 questions with objective answers
Repeated each question 5 times
Tested each response against 5 quality markers
Keyword matching. The response should contain what is provided as a keyword for the answer.
Response length. The response must be detailed but not very long. Long answer might include information users did not ask for. I marked high quality response as a response between 3 and 50 words.
Uncertainty detection. Answer should avoid hedging like “Maybe” or “I think”.
Negative word detection. LLM should avoid use contradiction on its answers.
Grammar and format. Answer should follow these rules (natural language).
Add score for each response, based on a 100 point scale
Calculate consistency rate for each response
Question were given to Postman on a .csv format. The table including the response and the scoring was sent back on a new .csv file on my email address.
The questions asked
| Question | Expected answer | Keyword |
| What is the result of 3×3? | 9 | 9 |
| Which is the capital of Greece? | Athens | Athens |
| When did World War 2 finish? | 1945 | 1945 |
| At what temperature (Celsius) does water boil? | 100 | 100 |
| How many legs does a dog have? | four | four |
The scoring system
Keyword match : 40 points (proper answer is most important)
Response length : 15 points (between 3 and 50 words)
No uncertainty words : 20 points (avoid “Maybe”, “I think” etc)
No negative words : 20 points (avoid “wrong”, “incorrect” etc)
Proper formatting : 5 points (capitalisation, punctuation)
Result characterisation
95+ points : Passed
80-94 points : Needs review
Below 80 points : Failed
Results
| Question | Pass rate | Review rate | Fall rate | Consistency |
| What is the result of 3×3? | 3/5 | 2/5 | 0/5 | 60% |
| What is the capital of Greece? | 0/5 | 5/5 | 0/5 | 0% |
| When did the World War 2 end? | 1/5 | 3/5 | 1/5 | 20% |
| At what temperature (Celsius) does water boil? | 5/5 | 0/5 | 0/5 | 100% |
| How many legs does a dog have? | 0/5 | 0/5 | 5/5 | 0% |
Example responses
On end of WWII question answers has a level of variation. 1 attempt passed, 1 failed and 3 need further review. All responses contained the proper information (1945) but other details made the result vary.
Question : When did world war 2 end?
Attempt 1 response : Response gathered 85 points, so test passed. Points were lost because of the length of the response
Attempt 2 response : This response gathered 65 points and failed, even if the information provided was correct. The reason behind that is that the response was over 10 times longer than the maximum length limit, so it lost 20 points for that and 20 more points because in such a long response negative words and phrases were included.
Attempt 3, 4 & 5 response : All these three responses had way more length that what they should. They lost 20 points for that and 5 more for nor proper format. So, with 85 points they all need reviewing.
What does the result mean
For Developers
If you're using LLMs in production, consistency matters. A system that gets the answer right 80% of the time isn't 80% reliable, it's unreliable. You need either:
Deterministic settings (e.g. lower temperature)
Multiple choice validation
Human review for critical facts
For users
Don't ask your AI once and assume it's correct. For important information, ask multiple times or cross reference. The model might know the answer, but it might also forget it randomly.
Notes : For this test I used GLM-4.5-Air (available free through OpenRouter) with temperature set at 0.5 and maximum token limit set at 1000.