
Image: Science Daily
A new study reveals ChatGPT's surprising inaccuracies in understanding scientific claims. Discover the implications for AI's reliability in critical decisions.
GlipzoA recent study conducted by Washington State University professor Mesut Cicek and his research team has uncovered alarming shortcomings in the capabilities of ChatGPT when it comes to evaluating scientific claims. In a rigorous testing environment, the researchers sought to determine if the AI could accurately identify whether hypotheses drawn from scientific literature were true or false. Over a span of two years, the team scrutinized a total of 719 hypotheses, posing the same question multiple times to gauge ChatGPT's consistency and reliability.
The initial phase of the experiment, carried out in 2024, revealed that ChatGPT achieved an accuracy rate of 76.5%. By the following year, the effectiveness of the AI slightly improved to 80%. However, when adjustments were made to account for random guessing, the results painted a less favorable picture. The AI's performance dipped to a mere 60% above chance levels, signifying that its reliability was closer to a low D than an impressive score.
One of the most concerning findings from the study was the AI's struggle to correctly identify false statements. ChatGPT was only able to label these claims accurately 16.4% of the time. Furthermore, significant inconsistencies were noted during the testing process. Even when presented with identical prompts on multiple occasions, ChatGPT produced varied answers around 73% of the time.
Professor Cicek emphasized the implications of these inconsistencies, stating, "If you ask the same question repeatedly, you should expect similar responses. Instead, we found stark differences in answers. In several instances, responses alternated between true and false, indicating a lack of reliability in the AI's reasoning."
Published in the Rutgers Business Review, these findings serve as a stark reminder of the limitations of AI in contexts that demand complex reasoning and nuanced understanding. While generative AI models like ChatGPT can produce text that appears fluent and persuasive, they lack a true grasp of context and meaning. According to Cicek, this suggests that the dream of achieving artificial general intelligence—a form of AI that can think and reason like humans—remains a distant goal.
"Current AI tools don't understand the world the way we do—they don’t have a 'brain.' They memorize and regurgitate information without true comprehension," he explained. This limitation underscores the necessity for cautious application of AI, especially in critical decision-making scenarios.
The research involved collaboration among several academic institutions, with Cicek working alongside co-authors Sevincgul Ulu from Southern Illinois University, Can Uslay from Rutgers University, and Kate Karniouchina from Northeastern University. The hypotheses utilized in the study were derived from scientific journals focused on business, all published since 2021. The nature of these questions was typically complex, requiring sophisticated reasoning to discern whether a hypothesis was supported or not.
The experiments tested both the free version of ChatGPT-3.5 in 2024 and the upgraded ChatGPT-5 mini in 2025. Interestingly, the performance metrics remained largely unchanged between the two iterations. After factoring in random chance, the AI's effective accuracy was significantly lower than initially reported.
These results highlight a fundamental shortcoming in large language model AIs. While they are capable of generating coherent and compelling narratives, they often falter when tasked with intricate reasoning. Cicek cautioned that this limitation can lead to responses that, while sounding plausible, may ultimately be misguided or incorrect.
Given the findings of this study, Cicek and his colleagues recommend that business leaders and professionals exercise a degree of skepticism when utilizing AI-generated content. They advocate for thorough verification of information produced by AI systems, particularly in high-stakes environments where accuracy is paramount. The researchers also stress the importance of training in understanding the capabilities and limitations of AI tools.
Although this research focused specifically on ChatGPT, Cicek pointed out that similar studies conducted on other AI platforms have yielded comparable results, reinforcing the notion that AI should not be relied upon as an infallible source of truth.
As AI technology continues to evolve, the insights garnered from this study may shape future developments in the field. The gap between AI performance and genuine understanding raises critical questions about the role of artificial intelligence in scientific research and decision-making. To advance the capabilities of AI, ongoing research and refinement will be necessary.
Looking ahead, stakeholders in academia, business, and technology must remain vigilant and informed about the evolving landscape of AI. As tools become more sophisticated, they should also be scrutinized with a critical eye to ensure that their use enhances, rather than compromises, the integrity of scientific inquiry and informed decision-making.
The journey toward a more reliable AI will require collaboration among researchers, developers, and users to create systems that can not only communicate effectively but also understand the complexities of the world around them.

A 50-year survey reveals over half of the UK's butterfly species are declining, highlighting urgent conservation needs amid habitat loss and climate change.
BBC Science
Catch the historic moment as the hatch of Artemis II's Orion capsule opens, reuniting astronauts with a cheering recovery crew after their groundbreaking mission.
BBC Science
Discover the critical consultation for the golden eagle reintroduction in northern England, aiming to restore this majestic species by 2027.
BBC Science