Microsoft: Here’s how we fix bad spelling in 100 languages to get you the right search results

As many as 15% of queries are misspelled and, when they are, search engines can…

Developer working on laptop computer.

As many as 15% of queries are misspelled and, when they are, search engines can deliver bad answers,


Image: Getty Images/iStockphoto

Microsoft has explained how it is using a variety of technologies and techniques to fix bad spellings that can mean queries addressed to its Bing search engine would otherwise deliver the wrong results. 

The software giant is getting back to basics in its latest push by focussing on spelling errors when people search online. It reckons that 15% of queries are misspelled and, when they are, search engines can deliver bad answers. 

So, Microsoft has figured out that it needs to automatically fix users’ poor spelling in order to improve the experience of Bing

SEE: An IT pro’s guide to robotic process automation (free PDF) (TechRepublic)

“Spelling correction is the very first component in the Bing search stack because searching for the correct spelling of what users mean improves all downstream search components,” Microsoft notes

Microsoft has has had “high-quality spelling correction” for about two dozen languages for a while, but is now expanding Bing spelling correction to cater for over 100 languages. 

“In order to make Bing more inclusive, we set out to expand our current spelling correction service to 100-plus languages, setting the same high bar for quality that we set for the original two dozen languages. We’ve found we need a very large number of data points to train a high-quality spelling correction model for each language, and sourcing data in over 100 languages would be incredibly difficult logistically – not to mention costly in both time and money,” it says.

This rapid increase in languages covered was enabled by Microsoft researchers leveraging recent advances in AI, including zero-shot learning combined with carefully designed large-scale pre-training tasks, plus historical linguistics theories.

Its engineers acknowledge the benefits of using web documents for language models, but they call out the approach’s shortcomings for minority languages. 

“For precise and high-performing error models, search engines have largely leveraged user feedback on autocorrection recourse links. This practice has been very effective, especially for languages where user feedback data has been gathered on a large scale. For a language with very little web presence and user feedback, it’s challenging to gather an adequate amount of training data.”

SEE: 10 tech predictions that could mean huge changes ahead

Microsoft’s Speller100 tool is focussed on rarer languages by targeting language families that share characteristics. 

“Imagine someone had taught you how to spell in English and you automatically learned to also spell in German, Dutch, Afrikaans, Scots, and Luxembourgish. That is what zero-shot learning enables, and it is a key component in Speller100 that allows us to expand to languages with very little to no data.”

After conducting Bing online A/B testing using the new tool, Microsoft said the the number of pages with no results reduced by up to 30%, the number of times users had to manually reformulate their query reduced by 5%, and the number of times users clicked on any item on the page went from single digits to 70%.