Study shows these ai-powered virtual assistants dish out a significant amount of inaccurate medical information.
A SUBSTANTIAL amount of medical information provided by five popular chatbots is inaccurate and incomplete, with half of the answers to clear evidence based questions “somewhat” or “highly” problematic, show the results of a study published in the open access journal BMJ Open.
Continued deployment of these chatbots without public education and oversight risks amplifying misinformation, warn the researchers.
Generative artificial intelligence (AI) chatbots have been rapidly adopted across research, education, business, marketing and medicine, with many people using them like search engines, including for everyday health and medical queries, explain the researchers.
To gauge the level of accuracy provided in areas of health and medicine already prone to misinformation, and therefore with consequences for everyday health behaviour, the researchers probed five publicly available and popular generative AI chatbots in February 2025: Gemini (Google); Deepseek (High-flyer); Meta AI (Meta); CHATGPT (Openai); and Grok (XAI).
Each chatbot was prompted with 10 open ended and closed questions in each of five categories of cancer, vaccines, stem cells, nutrition and athletic performance.
The prompts were designed to resemble common “information-seeking” health and medical queries and misinformation tropes online and in academic discourse.
And they were developed to “strain” models towards misinformation or contraindicated advice – a strategy increasingly used for stress testing AI chatbots and picking up behavioural vulnerabilities, note the researchers.
Closed prompts required chatbots to provide pre-defined responses, often with one correct answer, that aligned with the scientific consensus.
Open ended prompts typically required chatbots to generate multiple responses in list form.
Responses were categorised as non-, somewhat, or highly problematic, using objective pre-defined criteria.
A problematic response was defined as one that could plausibly direct lay users to potentially ineffective treatment or come to harm if followed without professional guidance.
The information was scored for accuracy and completeness, and particular attention was given to whether a chatbot presented a false balance between science and non-science based claims, regardless of the strength of the evidence.
Each response was also graded on readability, ranging from whether it was written in easy, plain English, to difficult, academic language, using the Flesch Reading Ease score.
Half (50%) the responses were problematic: 30% were somewhat, and 20% were highly problematic.
Prompt type was influential: open ended prompts, for example, produced 40 highly problematic responses – significantly more than expected – and 51 non-problematic responses – significantly fewer than expected.
The opposite was true of closed prompts.
While the quality of responses didn’t differ significantly among the five chatbots, Grok generated significantly more highly problematic responses than would be expected (29/50; 58%).
Gemini generated the fewest highly problematic responses and the most non-problematic ones.
The chatbots performed best intheareaofvaccinesand cancer, and worst in the area of stem cells, athletic perforaccess mance and nutrition.
Answers were consistently expressed with confidence and certainty, with few caveats or disclaimers.
Out of the total 250 questions, there were only two refusals to answer, both of which came from Meta AI in response to queries about anabolic steroids and alternative cancer treatments.
Reference quality was poor, with an average completeness score of 40%.
Chatbot hallucinations and fabricated citations meant that no chatbot provided a fully accurate reference list.
All readability scores were graded as difficult, equivalent in complexity to suitability for a college graduate.
The researchers acknowledge that they only assessed five chatbots and that commercial AI is rapidly evolving, so their findings might not be universally applicable.
And not all real-world queries are deliberately adversarial, an approach they took which may have overstated the prevalence of problematic content.
Nevertheless, “Our findings regarding scientific accuracy, reference quality and response readability highlight important behavioural limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication,” they point out. “By default, chatbots do not real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences.
“They do not reason or weigh evidence, nor are they able to make ethical or value-based judgements,” they explain. “This behavioural limitation means that chatcan bots reproduce authoritative-sounding but potentially flawed responses.”
The data chatbots draw on also includes Q&A forums and social media, and scientific content is typically limited to open access or publicly available articles, which comprise only 30-50% of published studies.
While this enhances conversational fluency, it may come at the cost of scientific accuracy, advise the researchers. “As the use of AI chatbots continues to expand, our data highlight a need for public education, professional training and regulatory oversight to ensure that generative AI supports, rather than erodes, public health,” they conclude.
Related

Artificial intelligence-driven chatbots are giving users problematic medical advice about half the time, according to a new study, highlighting the health risks of the technology that’s becoming increasingly integral in day-to-day life.
Researchers from the US, Canada and the UK evaluated five popular platforms – ChatGPT, Gemini, Meta AI, Grok and DeepSeek – by asking each of them 10 questions across five health categories. Out of the total responses, about 50% were deemed problematic, including almost 20% that were highly problematic, according to findings published this week in medical journal BMJ Open.
The chatbots performed relatively better on closed-ended prompts and questions related to vaccines and cancer, and worse on open-ended prompts and in areas like stem cells and nutrition, according to the study.
Answers were often delivered with confidence and certainty, though no chatbot produced a fully complete and accurate reference list in response to any prompt, the researchers said. There were only two refusals to answer a question, both from Meta AI.
The results highlight the growing concern about how people are using generative AI platforms, which aren’t licensed to give medical advice and lack the clinical judgment to make diagnoses.
The explosive growth of AI chatbots has made them a popular tool for people seeking guidance on their ailments and OpenAI has said that more than 200 million people ask ChatGPT health and wellness questions every week. The platform announced in January health tools for both everyday users and clinicians, and Anthropic said the same month its Claude product is launching a new health care offering.
A major risk to the deployment of chatbots without public education and oversight is that they could amplify misinformation, the BMJ Open study authors said.
The findings "highlight important behavioral limitations and the need to reevaluate how AI chatbots are deployed in public-facing health and medical communication,” they wrote. These systems can generate "authoritative-sounding but potentially flawed responses,” they wrote. – Bloomberg
Related stories:
- AI chatbots offer children harm as if it were help, says activist
- Patients are using chatbots to fight medical bills, with mixed results
- Doctors couldn’t help them. They rolled the dice with AI.
- Seeking a sounding board? Beware the eager-to-please chatbot.
- AI chatbots want your health records. Tread carefully.












