Can a Computer Recognise Dialect ?
Is a computer able to understand whether a sentence is written in Drents, Limburgs, or Gronings? Researchers at the Meertens Institute decided to take up this question. Using over a thousand Dutch dialect novels and a specifically trained language model, they show that artificial intelligence performs remarkably well at identifying regional speech. This offers not only technical insights, but cultural ones too.
The library of the Meertens Institute holds over 1,100 dialect novels, written in a wide range of regional languages across different periods. “We knew that there was something special in that collection,” says Nikki Beyer, who started on the project as an intern and is now involved as a PhD student. “All those dialects preserved in writing, we felt there had to be more to discover.” The first step was digitising the books, a labour-intensive process that took over a year. Only then could the real research begin: Can a language model tell the difference between dialect and Standard Dutch?
Covers of dialect novels written by Bart Veenstra© Meertens Institute
From BERTje to Meertje
For the experiment, Beyer and her colleagues used an existing language model: BERTje, developed at the University of Groningen. They retrained the model on tens of thousands of sentences taken from dialect novels. The result was a new model with a new name: Meertje. Meertje may look similar to generative AI systems like ChatGPT, but it works differently. “This isn’t a model that generates text,” Beyer stresses. “Meertje analyses language. It reads sentences and figures out whether they are written in dialect.”
To teach the model this, Beyer manually labelled around thirty thousand sentences as either ‘dialect’ or ‘no dialect’. The model was first tested on Drents, using texts from one specific writer: Bart Veenstra. It soon became clear that Meertje was starting to pick up on patterns that set Drents apart from Standard Dutch. Then came the real surprise: The model could identify other Dutch dialects too, from Gronings to Limburgs, with an accuracy rate of around 95 percent. “That was honestly a dream outcome,” says Beyer.
Nikki Beyer, first as an intern and now as a PhD candidate affiliated with the project Dialect Novels at the Meertens Institute.More than just spelling
What Meertje detects is more than just distinctive spelling. A simple phrase such as ‘wat zult dat’ was immediately identified as dialect by the model. That is because the model also pays attention to structure, rhythm, and grammatical patterns. Vowel combinations that are rare in Standard Dutch but common in Drents also play a role. The same goes for subtle differences in verb forms, noun endings, and sentence structure. What stands out is that Meertje learns how dialects differ from Standard Dutch and is then able to use that knowledge to recognise other regional languages.
Dialect as a social signal
The research also revealed literary insights. In a lot of these novels, the main characters speak dialect, while doctors, officials, and other people in positions of power speak Standard Dutch. “Even among writers who speak dialect themselves, you can see that language is never socially neutral,” says Beyer. Dialect signals closeness, identity, and sometimes inferiority, whereas Standard Dutch can convey distance and authority. The language model makes these patterns visible at scale.
Dialect shows where you come from
Beyer first studied literary studies and later linguistics. This project brings those two worlds together. “It sits exactly at the intersection of what I love,” she says. “Using computational methods to understand something human and cultural.” That human aspect is more relevant than ever before. Dialects are losing ground in everyday life, yet they are also making a comeback. “In Limburg and Friesland, young people are embracing their regional language with pride,” says Beyer. “Dialect is not outdated. It shows exactly where you come from.” Dialect novels are especially important because they express identity and culture through literature.
A large digital collection of dialect texts
The next research question is figuring out why Meertje identifies a sentence as dialect. Which features matter most? Sounds, grammar, word order? These questions should ultimately tell us more about the linguistic structure of dialects.
At the same time, the researchers are building a large digital collection of dialect texts. After removing digitisation errors, each text is enriched with metadata on location, time period, and language use. By linking these texts to dialect dictionaries and other databases, researchers will be better able to compare how written dialect relates to spoken language.











Leave a Reply
You must be logged in to post a comment.