Skip to main content

Command Palette

Search for a command to run...

Same Question, Same Answer? My Automated LLM Consistency Test

Updated
4 min read

Every day we use AI systems. We trust them. We trust them for answers on our jobs, on our hobbies, even on our lives. But there is a question that nobody (or a few do) asks: If I ask an AI the same question twice, will it provide the same answer?

I decided to find out. But here is the catch: Validating AI consistency manually is a tedious and time consuming job. You have to ask the same questions repeatedly, track responses, compare answers… It’s kinda boring and whoever has to do it, will do it poorly. Against that I built an automated testing framework and asked 5 fundamental questions with 5 repetitions each (total 25 questions).

Objective

The objective here is simple. Find out if GLM-4.5-Air (available free through OpenRouter) could answer consistenlty on the same question.

Methodology

Instead of manually testing, I created an automated LLM evaluation framework, using Postman. Here is what i did :

  1. Selected 5 questions with objective answers

  2. Repeated each question 5 times

  3. Tested each response against 5 quality markers

    • Keyword matching. The response should contain what is provided as a keyword for the answer.

    • Response length. The response must be detailed but not very long. Long answer might include information users did not ask for. I marked high quality response as a response between 3 and 50 words.

    • Uncertainty detection. Answer should avoid hedging like “Maybe” or “I think”.

    • Negative word detection. LLM should avoid use contradiction on its answers.

    • Grammar and format. Answer should follow these rules (natural language).

  4. Add score for each response, based on a 100 point scale

  5. Calculate consistency rate for each response

Question were given to Postman on a .csv format. The table including the response and the scoring was sent back on a new .csv file on my email address.

The questions asked

QuestionExpected answerKeyword
What is the result of 3×3?99
Which is the capital of Greece?AthensAthens
When did World War 2 finish?19451945
At what temperature (Celsius) does water boil?100100
How many legs does a dog have?fourfour

The scoring system

  • Keyword match : 40 points (proper answer is most important)

  • Response length : 15 points (between 3 and 50 words)

  • No uncertainty words : 20 points (avoid “Maybe”, “I think” etc)

  • No negative words : 20 points (avoid “wrong”, “incorrect” etc)

  • Proper formatting : 5 points (capitalisation, punctuation)

Result characterisation

  • 95+ points : Passed

  • 80-94 points : Needs review

  • Below 80 points : Failed

Results

QuestionPass rateReview rateFall rateConsistency
What is the result of 3×3?3/52/50/560%
What is the capital of Greece?0/55/50/50%
When did the World War 2 end?1/53/51/520%
At what temperature (Celsius) does water boil?5/50/50/5100%
How many legs does a dog have?0/50/55/50%

Example responses

On end of WWII question answers has a level of variation. 1 attempt passed, 1 failed and 3 need further review. All responses contained the proper information (1945) but other details made the result vary.

Question : When did world war 2 end?
Attempt 1 response : Response gathered 85 points, so test passed. Points were lost because of the length of the response
Attempt 2 response : This response gathered 65 points and failed, even if the information provided was correct. The reason behind that is that the response was over 10 times longer than the maximum length limit, so it lost 20 points for that and 20 more points because in such a long response negative words and phrases were included.
Attempt 3, 4 & 5 response : All these three responses had way more length that what they should. They lost 20 points for that and 5 more for nor proper format. So, with 85 points they all need reviewing.

What does the result mean

For Developers

If you're using LLMs in production, consistency matters. A system that gets the answer right 80% of the time isn't 80% reliable, it's unreliable. You need either:

  • Deterministic settings (e.g. lower temperature)

  • Multiple choice validation

  • Human review for critical facts

For users

Don't ask your AI once and assume it's correct. For important information, ask multiple times or cross reference. The model might know the answer, but it might also forget it randomly.

Notes : For this test I used GLM-4.5-Air (available free through OpenRouter) with temperature set at 0.5 and maximum token limit set at 1000.