Kakao announced on the 1st that it has released the performance and development review of its integrated multimodal language model "Kanana-o" and audio language model "Kanana-a" through its official tech blog.
Kanana-o is the first artificial intelligence (AI) model in the country capable of simultaneously understanding and processing various forms of information, such as text, voice, and images.
Users can input questions in any combination of text, voice, and images, and Kanana-o can process them, generating responses to context-appropriate text or natural voice input.
Kakao efficiently developed Kanana-o in a short period by integrating Kanana-v, which specializes in image processing, and Kanana-a, which specializes in audio understanding and generation, based on model merging technology that combines different models.
In particular, it accurately reflected the unique speech structure, intonation, and tense changes of the Korean language by utilizing a large-scale Korean dataset.
As a result, Kanana-o can recognize regional dialects, such as those from Jeju and Gyeongsang provinces, and convert them into standard language to generate natural speech.
According to Kakao, Kanana-o recorded a performance level similar to the world's best AI models in Korean and English benchmarks and showed a significant advantage in the Korean benchmark.
In terms of emotional recognition capabilities, it recorded a significant advantage in both Korean and English, proving the potential of an AI model that can understand and communicate emotions.
Kim Byeong-hak, leader of Kakao Kanana's performance, noted that "based on proprietary multimodal technology, we plan to strengthen our competitiveness in artificial intelligence technology while consistently contributing to the development of the domestic AI ecosystem through ongoing research result sharing."