Why Large Language Models Struggle with Niche Programming Languages

When you ask ChatGPT or Code Llama to write a Python function, it performs like a seasoned developer. But ask the same model to write something in COBOL, Verilog, or Prolog, and the results quickly fall apart — strange syntax, non-compiling code, or logic that simply doesn’t make sense.

This isn’t because those languages are mysterious or obsolete. It’s because today’s large language models (LLMs) are trained for the common case. They speak the programming equivalents of English and Mandarin — not the rare dialects. In this piece, let’s unpack why these models struggle with domain-specific or lesser-known languages, what the data reveals, and how researchers are trying to fix it.

The Uneven Polyglot

Modern LLMs like GPT-4, Code Llama, and StarCoder are astonishingly capable across mainstream languages. They’ve seen massive quantities of Python, JavaScript, Java, and C++ — languages that dominate GitHub and Stack Overflow.

But the long tail of programming languages tells a different story. Languages such as COBOL, VHDL, Verilog, Haskell, Idris, or Prolog make up only a tiny fraction of the internet’s code corpus. Some, like COBOL, still power core financial systems worldwide, yet they remain underrepresented in the datasets used to train most LLMs.

The result is a model that’s an eloquent polyglot — but with an accent. It knows just enough about niche languages to sound confident, while often getting the details disastrously wrong.

Why This Happens

1. Training Data Is Skewed

LLMs learn from what they see. And what they see is overwhelmingly dominated by popular languages. Public repositories and question-answer sites provide hundreds of millions of Python functions for every handful of Verilog modules or COBOL procedures.

In the MultiPL-E benchmark — a multilingual extension of the HumanEval dataset — LLMs perform near the top of their range on Python, but their accuracy drops sharply on low-resource languages like R or Julia. When COBOL equivalents of those tasks were introduced in COBOLEval, GPT-4 solved only about 10 % of them, compared to roughly 67 % in Python. Almost half of its outputs didn’t compile at all.

The reason is simple: the statistical patterns of those niche languages barely appear during training, so the model never learns their deeper structure. It’s like trying to learn Icelandic by reading a few tweets.

2. Syntax and Tokenization Challenges

Another, subtler reason lies in how LLMs see code. Most use tokenizers — algorithms that split text into fragments. These tokenizers are optimized for English words and familiar programming symbols.

When they encounter something unusual — say, COBOL’s verbose PERFORM UNTIL or VHDL’s signal assignments — the tokenizer may chop it into awkward pieces the model rarely saw before. Each fragment is treated as something alien, making prediction harder.

Then there’s structure. COBOL forces you to declare variables in one section and use them later in another. LLMs, however, generate code linearly, one token after another. They’re good at top-to-bottom reasoning, but not at jumping around a rigid structure. That’s why even large models tend to produce misplaced or missing declarations in COBOL, or mix Verilog syntax with SystemVerilog unintentionally.

3. Paradigm and Semantics Mismatch

Some languages represent ideas fundamentally different from those in procedural or object-oriented code. Prolog relies on logical inference; Haskell enforces pure functional composition; Idris encodes proofs within its type system.

LLMs are, at heart, probabilistic next-token predictors. They can mimic style, but they don’t execute reasoning or type proofs internally. So while they might produce syntactically valid code, the semantics — the meaning — often fail. A Prolog predicate may look correct but not solve the problem; a Haskell function may violate type purity.

4. Fine-Tuning Priorities

Even if these languages appear in pretraining data, they’re rarely included during fine-tuning, where models learn to follow instructions and correct themselves. Companies optimize for user demand, which means Python, JavaScript, and SQL dominate.

As one engineer aptly put it: “Every parameter we dedicate to COBOL is one we take away from Python.”

This explains why even powerful models like GPT-4, which theoretically “know” dozens of languages, stumble without specific domain fine-tuning. They have broad exposure but shallow specialization.

What the Data Shows

The evidence is consistent:

COBOL: GPT-4 solves ~10 % of tasks on COBOLEval; Code Llama 34B solves ~2 %. A smaller model fine-tuned on COBOL outperformed both.
Verilog: Models confuse Verilog with SystemVerilog; accuracy drops as complexity increases.
Prolog and Haskell: LLMs manage beginner-level examples but fail on logical reasoning or type inference.
Rust: Even with rising popularity, models still produce code that breaks Rust’s borrow checker.

Across benchmarks, the pattern is clear: high performance correlates with data abundance, not inherent language difficulty.

Bridging the Gap

The good news: researchers are already working on it.

1. Broader, Balanced Datasets
Open-source efforts like The Stack v2 and StarCoder2 deliberately include hundreds of programming languages. This ensures even rare ones appear often enough for models to notice patterns. When real data is scarce, synthetic translation — converting Python tasks into COBOL or Verilog — helps fill the gap.

2. Domain-Specific Fine-Tuning
Fine-tuning on curated corpora changes everything. Bloop AI fine-tuned Code Llama on COBOL, producing a variant that surpassed GPT-4 on that benchmark. Similar successes have emerged for Verilog, showing that targeted retraining outperforms sheer model size.

3. Compiler-in-the-Loop Learning
An emerging idea is to include compilers in the feedback loop. The model generates code, checks it with a compiler, learns from the error messages, and retries. Over time, it internalizes what “compiles” and what doesn’t — grounding its predictions in executable reality.

4. Benchmarks and Incentives
Every time a new benchmark appears — COBOLEval, VerilogEval, MultiPL-E — the research community rallies to improve results. Benchmarks create accountability. They make blind spots visible and give researchers a reason to care about the long tail.

Toward a Truly Multilingual AI

Today’s code models resemble skilled linguists who can converse fluently in English but fumble through ancient Greek. Their limitations are not due to lack of intelligence but to what they’ve been exposed to.

Expanding training data, embracing fine-tuning, and closing the feedback loop will gradually make them fluent in the full spectrum of programming languages. When that happens, AI won’t just assist with popular frameworks — it will help maintain legacy mainframes, design next-generation hardware, and even reason in logic languages most developers have never touched.

The road to that future starts with a simple realization: if we want truly intelligent coding assistants, they must learn every dialect of computation — not just the ones trending on GitHub.