top of page

The Surprising Language That Outsmarted AI — And It’s Not English or Chinese

  • Writer: Tomasz Kruk
    Tomasz Kruk
  • Oct 29
  • 3 min read

Updated: Oct 30

For AI, the ideal language is one that’s grammatically rich — packing tense, gender, and mood into each word — and written in the Latin alphabet, which tokenizers handle more efficiently than complex scripts like Chinese or Tamil. More grammar per sentence means more meaning per token — and faster learning.

ree

This may sound like an obscure linguistic theory. But it’s actually a key reason why Polish — yes, Polish — just outperformed English and Chinese in a benchmark designed to test artificial intelligence’s understanding of long, complex texts.

It’s the kind of twist that rewrites assumptions.

A recent paper titled OneRuler tested large language models (LLMs) across 26 different languages. These are the models behind tools like ChatGPT and Claude — the engines trying to learn how humans think through language.

The expectation? English would dominate. Chinese might follow closely behind.

The actual winner? Polish.

Wait. Polish? Really?

Let’s set the scene: English dominates the internet. Chinese dominates global population and state-sponsored AI development. These are the heavyweights of AI training data.

So when researchers tested which language an AI model understood best — especially in long documents requiring memory, inference, and context — most bets were on English.

But Polish came out on top.

English placed sixth. Chinese didn’t even make the top 15.

This wasn’t a statistical anomaly. It was a wake-up call.

Not All Languages Are Equal — Especially to AI

ree

LLMs don’t “understand” language like we do. They’re pattern machines. They learn by reading enormous volumes of text — books, Wikipedia, Reddit threads, code — and predicting what word comes next.

This process is hugely sensitive to structure. And here’s the thing: Polish gives AI more structure per sentence.

It’s what linguists call a morphologically rich language — one that crams grammatical information (person, number, tense, gender, mood) into single words. Where English uses helper words and fixed word order, Polish uses endings and inflections. That makes each token — each unit of meaning — far more dense.

That density helps language models learn more efficiently.

And there’s a technical advantage, too: Polish uses the Latin script. Languages with Latin alphabets tend to be easier for tokenizers — the software tools that chop up text for processing. In contrast, Chinese, Tamil, and other non-Latin languages introduce complications that reduce model performance, especially in long-form contexts.

Language Is a Cognitive System — Not Just a Medium

But this isn’t just a quirk of machine learning. There’s a deeper implication: language shapes cognition.

We’ve long known that the language we speak can influence how we think. In cultures with gendered grammar, for example, people perceive objects differently. Speakers of languages with complex case systems often score higher on tasks involving pattern recognition or working memory.

ree

Poland’s track record here is striking:

  • Maria Skłodowska-Curie — the first person ever to win two Nobel Prizes (Physics, 1903; Chemistry, 1911); her disciplined, evidence-based work set a standard that still guides modern science

  • Marian Rejewski cracked the Enigma code before Alan Turing.

  • Stanisław Ulam helped invent the Monte Carlo method.

  • Wojciech Zaremba co-founded OpenAI.

This isn’t a coincidence. It’s the result of a cognitive culture steeped in linguistic rigor, logical structure, and constraint-based reasoning.

For AI, More Grammar Means Better Learning

So what does this mean for AI development?

Right now, most models are still heavily trained in English. That’s convenient, but it’s also limiting. English is relatively low on grammatical inflection — and high on ambiguity. It works for casual communication, but it may not be the best training ground for systems we want to be precise, logical, and fair.

The OneRuler results suggest that training on languages like Polish — or Finnish, Hungarian, Turkish — could offer models better exposure to structural nuance. It might even make them better at handling code, contracts, or complex reasoning.

If you’re designing systems to understand humans, start by understanding how language encodes thought.

Polish didn’t win because it’s “better.” It won because its structure gave the AI model more to learn from.

So... Should You Learn Polish?

Only if you enjoy puzzles.

The point isn’t to replace your Duolingo language streak — it’s to recognize that languages are not interchangeable data formats. They are cognitive architectures, evolved over centuries, shaping how people perceive, reason, and relate to the world.

If we want AI to truly engage with human language, we need to stop treating it as just another input-output stream — and start treating it as thinking material.

Language Is Code — In More Ways Than One

As AI systems get smarter, we have a choice: continue training them in convenience, or train them in complexity.

Polish shows us that complexity pays off.

The question isn’t which language is popular

It’s which one teaches your system to think.


Sources

Kim et al., OneRuler: Benchmarking Multilingual Long-Context Language

ModelsarXiv:2503.01996v3 – https://arxiv.org/abs/2503.01996


Comments


Thanks for submitting!

  • Facebook
  • Twitter
  • LinkedIn

+41 792295723

Seestrasse 7, 6330 Cham, Switzerland

Contact

bottom of page