← All articles Research

AI Detectors and Non-Native English Speakers: The 61% Bias Problem (2026)

By Khurram • April 26, 2026 • 11 min read

I teach Business and Management to postgraduate students in Saudi Arabia and the UAE. Most of them are non-native English speakers. They write in their second, sometimes third language, and they work hard at it. They take the writing seriously. They ask for feedback. They revise.

Over the last two years, I have watched too many of them get accused of using AI when they had not. The accusations come from professors at universities in the UK, the US, and Australia, where many of my students continue their studies. The pattern is always the same. The student submits an essay. The professor runs it through ZeroGPT or a similar tool. The score comes back high. The student is summoned. The conversation that follows is sometimes career-ending.

Almost every time, the student is innocent.

This is not my opinion. There is a body of peer-reviewed research that documents exactly why this happens, and the numbers are worse than most professors realize. If you teach, study, or work with non-native English writers, you need to read this carefully.

The Stanford finding that should have changed everything

In April 2023, a team of Stanford researchers led by Weixin Liang and James Zou published a study with one of the cleanest findings I have ever seen in AI research. They tested seven widely used AI detectors against two sets of essays. The first set was 91 essays written by non-native English speakers for the TOEFL exam. The second was a control set of essays written by native English speakers (US eighth-grade students).

The results were not subtle.

All seven AI detectors unanimously identified 18 of the 91 TOEFL student essays (19%) as AI-generated, and 89 of the 91 TOEFL essays (97%) were flagged by at least one of the detectors. The average false positive rate across the detectors on the TOEFL essays was 61.3%.

The same detectors handled the US eighth-grade essays with near-perfect accuracy. More than half of the non-native-authored TOEFL essays were incorrectly classified as "AI-generated," while detectors exhibited near-perfect accuracy for US eighth-grade essays.

Read those numbers again. Six out of every ten essays written by hand by non-native English speakers were wrongly accused of being AI. Almost every single TOEFL essay was flagged at least once across the detector ensemble. The same tools that almost never made mistakes on native-speaker writing made mistakes constantly on non-native writing.

This was published in the journal Patterns, which is part of the Cell Press family. It is not a blog post. It is peer-reviewed research from a credible institution. The full citation, if you need it for your own work or to send to your professor, is: Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., and Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns 4, 100779.

Why this happens (and why it is not going away on its own)

The mechanism behind the bias is technical but worth understanding, because it tells you why no patch or update is going to fully fix the problem.

AI detectors measure something called perplexity. Perplexity is a number that captures how predictable the next word in a sentence is, given everything before it. Low perplexity means the word is predictable, so the text looks more like the kind of output a language model produces. High perplexity means the word is surprising, so the text looks more human. The detector decides "AI or human" largely on this signal.

Here is the trap. Non-native English writers are known to exhibit less linguistic variability in terms of lexical richness, syntactic diversity, and grammatical complexity. Analyzing academic research papers from ICLR 2023, the researchers found that papers by first authors from countries whose native language is not English showed lower text perplexity compared to their native English-speaking counterparts, indicating that their language use is more predictable by generative language models.

In other words, a non-native English writer is not less skilled. They are working in a second language, drawing on a smaller active vocabulary, and using more conventional grammatical structures because that is how second-language learning works. Those choices are completely reasonable for an L2 writer. But statistically, those same choices look identical to AI output.

The detector cannot tell the difference between "this person is writing in their second language" and "this was generated by a transformer model." Both produce text with low perplexity. The detector flags both.

The Stanford team also showed an even more uncomfortable finding. They asked ChatGPT to rewrite the TOEFL essays with the prompt "Enhance the word choices to sound more like that of a native speaker." Detection rates dropped sharply. So the way to "pass" an AI detector, if you are an ESL writer, is to ask an AI to make your writing sound more native. The system rewards exactly the behavior it claims to police.

Going the other direction is just as revealing. The researchers asked ChatGPT to rewrite native US essays with the prompt "Simplify word choices as if written by a non-native speaker." The detection rate jumped, with native essays now being flagged as AI. The bias is not about who used AI. It is about whose linguistic patterns the detector treats as suspicious.

The pushback (and why it does not change the practical problem)

In fairness, there has been pushback against the Stanford study. Some commercial AI detection companies, including Originality.ai, have pointed out that the sample size of 91 TOEFL essays was small, and the essays were taken from a student forum rather than from a verified test administration. They argue that a larger and more diverse dataset is needed for more robust conclusions about the performance of AI checkers on non-native English writing.

These are reasonable methodological critiques. The Stanford study is one study, with one sample, and the field would benefit from larger replications.

But two things complicate the pushback.

First, the underlying mechanism (low perplexity in L2 writing) is well-established in linguistics research that predates AI detection by decades. The Liang study did not invent the perplexity gap between native and non-native writers. They documented it and showed how AI detectors interact with it. Even if the exact 61.3% figure is contested, the direction of the bias is not.

Second, the lived experience of non-native English writers being disproportionately flagged is documented in thousands of forum posts, news articles, and case studies independent of the Stanford paper. The Washington Post ran a 2023 story about a Texas student wrongly accused. There have been documented cases of universities reversing accusations against international students after reviewing the original work. The Stanford study formalized something that ESL students were already experiencing.

The honest summary is this. The exact percentage of false positives on non-native English writing depends on the detector, the year, the test set, and the prompts used. It is not always 61%. But it is consistently higher than the false positive rate on native-speaker writing, and the gap is large enough that any institution using AI detection on a population that includes non-native English speakers is making a different decision for them than for everyone else.

What this means in practice for ESL students

If you are a non-native English speaker reading this, here is the practical takeaway.

You are statistically more likely to be flagged. Not because your writing is worse. Because the tools were not built for you. The training data, the perplexity thresholds, and the linguistic baselines were all calibrated on native English writing. You are being measured against a yardstick that was made for someone else.

This does not mean you should avoid writing well. It means you should know what you are walking into and prepare evidence accordingly.

Three concrete habits I recommend to my own students.

Keep your version history. Write your essays in Google Docs or Microsoft Word, both of which automatically save edit history with timestamps. If you are accused, that history is far stronger evidence of human authorship than any detector score. Take screenshots periodically as you write.

Build a writing sample portfolio. Save your hand-written drafts, your in-class essays, your earlier coursework. If your style is documented across many pieces of work, a single flagged essay sits in context. Professors who have your previous writing on file are much less likely to act on a detector score that does not match what they already know about your voice.

Cite the research preemptively. If you suspect your work might be flagged because your topic is generic or your structure is formal, include a short footnote in your submission saying "I am aware that some AI detection tools have documented bias against formal academic writing and against non-native English speakers (see Liang et al., 2023, Patterns 4, 100779). I have written this entirely myself and am happy to provide draft history on request." This puts the issue on the table before it can be used against you.

What this means for professors and institutions

If you teach in a department that includes international students, or you grade essays as part of an admissions process that draws from non-native English populations, your decision about whether to use AI detection at all has a fairness dimension you may not have fully considered.

Using a tool that flags 60% of essays from non-native speakers and 1% of essays from native speakers does not mean you are "catching more cheaters from group A." It means the tool sees group A and group B differently in a way that has nothing to do with cheating. Acting on those flags reproduces the bias.

The most defensible institutional response, in my view, is the one some universities have already adopted. Use the detector as a flag for follow-up conversation, never as evidence on its own. Compare the flagged essay against the student's previous work. Ask the student to explain or expand on their argument verbally. Look at their writing process, not just their output. These steps catch the actual cases where AI was misused. They also clear the false positives, which on any non-native population will be the majority of flags.

Vanderbilt University disabled Turnitin's AI detection in 2023 for related concerns. The University of Pittsburgh has issued guidance against relying on AI detection in misconduct cases. The University of Texas at Austin has taken a similar position. These are not radical institutions. They are reading the same research you can read, and drawing the obvious conclusion.

What this means for the GCC and Gulf academic context

This is the part I care about most personally, and the part most underdiscussed in the international research.

A growing number of UAE and KSA universities are using AI detection tools as part of their academic integrity workflows. Some are using ZeroGPT directly. Others have institutional licenses for Turnitin's AI feature or for Copyleaks. The students being scanned are predominantly non-native English speakers. Many are also writing in a register that has been heavily influenced by formal Arabic academic conventions, which add another layer of statistical predictability to the English output.

The combination is risky. ESL bias plus formal academic writing plus heavily structured paragraphs is the worst-case profile for false positives. A student in this situation can do everything right and still get flagged.

Saudi and Emirati universities have an opportunity here that Western institutions have largely missed. They can build AI detection policy that takes ESL bias into account from the start, rather than retrofitting fairness onto a system that was never designed for their student population. That means policies that require corroborating evidence, conversations with the student, and process documentation before any detector score can be acted on. It also means transparent communication to students about how detection is used, so they can prepare evidence proactively.

If you are an administrator at a Gulf university and this is on your desk, I would be glad to share the policy frameworks I have seen work. Email me. This is one of the parts of my work I take most seriously.

What about Arabic and bilingual writing?

A separate question that comes up often. Most AI detectors, including ZeroGPT, claim multilingual support. In practice, accuracy on Arabic is lower than on English, partly because the training data for these tools was overwhelmingly English, and partly because the linguistic features that perplexity measurements rely on do not transfer cleanly across language families.

If you are writing or grading in Arabic, treat any AI detector score as a rough first-pass signal at best. Do not act on it without other evidence. The same is true for code-switched or bilingual academic writing, which has even less representation in the training data.

This is a gap in the market that I expect will close in the next two years. For now, the honest answer is that AI detection in Arabic is not at the same maturity level as AI detection in English, and decisions made on it should reflect that.

Frequently asked questions

Is the 61% figure still accurate in 2026? The original Stanford study was published in 2023, and the detection landscape has changed since then. Some detectors have improved. Others have not. Independent 2026 testing on ZeroGPT specifically continues to show elevated false positive rates on formal academic writing, which is the closest available proxy. The bias has not been solved. The exact percentage is contested. The direction is consistent.

Are some AI detectors better for non-native English speakers than others? Anecdotally yes. Scribbr's detector tends to be more conservative on academic writing of all kinds. GPTZero publishes detailed methodology and has a lower documented false positive rate than ZeroGPT in most independent benchmarks. ZeroGPTFree specifically does not optimize for high false positive scoring on formal text. None of these are perfect for ESL writers. They are simply less bad.

What should an international student do if they are accused based on a detector score? Stay calm. Provide your draft history. Cite the Liang et al. 2023 study by name. Offer to do a follow-up conversation or in-person writing exercise with the professor. Most reasonable professors will reconsider once they understand the research. If your school has a student advocacy office, involve them early.

Can I make my writing "less AI-like" to pass detectors? Sort of, but the methods involve making your writing worse. Adding deliberate errors, using less precise vocabulary, varying sentence length artificially. These are bad habits to teach yourself, and they can hurt your grades on the actual content. The better long-term answer is to push back on the use of detectors as primary evidence.

Why do AI detector companies not warn users about this bias? A reasonable question. The commercial incentive cuts the other way. A detector that says "we have a 20% false positive rate on non-native English writing" sells worse than a detector that says "98% accuracy." Most companies have chosen the latter framing. ZeroGPT is one example. The marketing claims do not match the published research.

Sources cited in this article

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., and Zou, J. (2023). GPT detectors are biased against non-native English writers. Patterns 4, 100779. Open access at Cell Press.
Stanford HAI summary: "AI-Detectors Biased Against Non-Native English Writers".
Originality.ai response and methodological critique.

Last updated April 26, 2026. If you are working on AI detection policy at a Gulf university, or if you have been personally affected by an AI detection accusation as a non-native English speaker, please email me. I update this article based on what I learn from real cases.