OpenAI's ChatGPT logo./Courtesy of Yonhap News

OpenAI's recently unveiled inference-type artificial intelligence (AI) models 'o3' and 'o4 mini' have more powerful performance compared to the transfer generation, but evaluations indicate that the hallucination phenomenon has worsened. Hallucination refers to the phenomenon where generative AI produces information that does not actually exist as if it were true.

According to a report on the 20th by the IT-specialized media TechCrunch, referencing OpenAI's internal benchmark 'Person QA,' it was reported that the o3 model caused hallucination in 33% of the questions. This figure is more than double that of o1 (16%) and o3 mini (14.8%). More seriously, o4 mini recorded a staggering 48% hallucination rate, showing even greater instability than the existing models, including GPT-4o.

OpenAI introduced these models on the 16th, describing them as 'the first models capable of integrating images into the reasoning process.' This means that they can utilize visual information itself in the reasoning process, going beyond simply recognizing images. In practice, o3 and o4 mini have the ability to analyze whiteboard drawings, diagrams, and graphs uploaded by users, as well as process blurred or rotated images.

In terms of performance, o3 recorded 69.1% and o4 mini recorded 68.1% in the SWE benchmark related to coding, outperforming the previous model o3 mini (49.3%) as well as the competing model Claude 3.7 Sonnet (62.3%). However, despite this technological advancement, the hallucination rate has actually increased compared to before. As hallucination issues have gradually improved with each new model release, this result is seen as unusual.

OpenAI has yet to provide a clear explanation for the cause of this phenomenon. The technical report analyzed that 'as the model responds to more user requests than before, both accurate and incorrect results appear to be increasing together.' It further stated that 'more research is needed' to identify the exact cause of the increase in hallucinations.

The AI industry views this case as potentially raising questions about the reliability of inference-type models. Concerns have been raised that in industries where high accuracy is required, such as law, accounting, and taxation, the introduction of inference-type AI may become difficult if the hallucination problem is not resolved.

OpenAI noted that 'completely eliminating hallucinations across all problem areas is an ongoing research challenge' and stated that 'efforts to improve accuracy and reliability are continuing.'