The Learning Curve, Part 3: Taking AI Data From Good to Great

Samsung Research in Vietnam is part of a series about the people and innovations allowing mobile AI to enhance more lives.

Samsung is pioneering premium mobile AI experiences. To learn how Galaxy AI is maximizing the potential of its users, we are visiting Samsung Research centers around the world. Now supporting 16 languages, Galaxy AI is enabling more people to expand their language capabilities, even when offline, thanks to on-device translation in features such as Live Translate, Interpreter, Note Assist and Browsing Assist. We recently visited Jordan to learn the complexities of developing an AI model for Arabic, a language with many dialects. This time, we’re going to Vietnam to explore how data is prepared to train AI models.

What is the difference between a ghost, grave and mother in Vietnamese? For a language spoken by 97 million people worldwide, very little. Each word translates to “ma,” “mả” and “má,” respectively — and can only be distinguished by tone. This illustrates how difficult it can be for AI models to learn a language, considering they cannot recognize firsthand the context and emotions of conversations nor the intentions of those speaking.

Samsung R&D Institute Vietnam (SRV) used finely refined data to help its AI model properly recognize even the most subtle differences in language.

The quality of data used directly affects the accuracy of automatic speech recognition (ASR), neural machine translation (NMT) and text-to-speech (TTS) — processes that help Galaxy AI features such as Live Translate, Interpreter, Chat Assist and Browsing Assist break down language barriers.

A Typhoon of Challenges
“Vietnamese is a complex and diverse language with rich expressions, many of which are challenging to capture,” says Ngô Hồng Thái, NMT lead at SRV. Of the 16 languages that Galaxy AI supports, Vietnamese was particularly difficult to develop.

“Personally, creating an AI model for Vietnamese was more daunting than our typhoons!” he adds before explaining the hurdles faced during the development process.

Vietnamese is a tonal language with six distinct tones. As evident in the “ma” example above, small nuances in vocalization can drastically alter the meanings of words. Therefore, a meticulous and detailed approach was necessary.

“When similar sounding words are broken down, one word consists of several short segments, or ‘frame sets’,” says Bui Ngoc Tung, ASR lead at SRV. “The AI model differentiates between the short audio frames of around 20 milliseconds to recognize what words correspond to a certain set of consecutive frames. As such, it is critical to put great effort into the early stages of the AI learning process.”

Furthermore, homophones and homonyms are common in Vietnamese. People can normally rely on context and nonverbal elements in conversations to differentiate between words that sound the same or are written the same but have different meanings. However, AI models need to be taught to accurately identify and differentiate between tones and similar words.

“This isn’t a straightforward task,” Thái explains. “Apart from the amount, the data needs to be accurate to ensure it is capable of recognizing the linguistic nuances that exist in Vietnamese.”

Rigorous Preparation
The data refinement process consists of three steps. First, the audio and text used to train the AI model must be reviewed and corrected. Then, this dataset goes through random checks for overall quality. Finally, the dataset is normalized and cleaned before use in training.

“We thoroughly performed a series of tests to check the accuracy of our dataset,” says Nguyen Manh Duy, TTS lead at SRV who oversees database creation. “We faced a number of unexpected problems including misspelled words in scripts and background noise or incorrect pronunciation during audio recordings. We spent significant time refining and improving our training data.”

A vital part of the data refinement process and the journey of taking AI data from good to great is the work of the Software Quality Engineering (SQE) team. The team plays an important role in testing and improving AI language data quality and they work closely with the AI language development project team to make it happen.

image of The Learning Curve Part 3
In addition to the unique linguistic challenges in Vietnamese, there is a lack of universally accessible data compared to more widely spoken languages. “This is another reason why the data refinement stage is so important,” he adds. “Since we had limited sources, every piece of data had to be fully reliable. There was no margin for error.”

Moreover, the AI model for Vietnamese must consider both tonal and regional differences. To improve the AI model’s accuracy, the team collected vast amounts of data with Vietnam’s northern, central and southern accents — resulting in an enormous amount of information to refine and verify.

Continued Improvement
Developers at SRV completed the project after months of hard work, and Vietnamese became one of the first languages to be supported by Galaxy AI. Despite this success, the team is ceaselessly working to improve the Vietnamese Galaxy AI experience.

“We’re continuing to enhance the AI model by incorporating user feedback about the relevance of words and phrases in Galaxy AI,” says Tran Tuan Minh, leader of the AI language development project at SRV. “We have just taken our first steps into a more open world — and we have so much more to explore together.”

Image of The Learning Curve, Part 3: Taking AI Data From Good to Great
In the next episode of The Learning Curve, we will head to China to dig into how AI models are trained and fine-tuned.