{"id":993,"date":"2026-05-15T05:14:51","date_gmt":"2026-05-14T22:14:51","guid":{"rendered":"https:\/\/blog.datacore.vn\/?p=993"},"modified":"2026-05-15T05:14:53","modified_gmt":"2026-05-14T22:14:53","slug":"ai-training-data-vietnam-synthetic-shortcuts","status":"publish","type":"post","link":"https:\/\/blog.datacore.vn\/en\/ai-training-data-vietnam-synthetic-shortcuts\/","title":{"rendered":"AI Training Data: 3 Hidden Costs of Synthetic Shortcuts in Vietnam"},"content":{"rendered":"\n<p>By July 2024, the question of where the next generation of language models would get its <strong>AI training data<\/strong> had stopped being theoretical. Public internet text was already saturating, and the projected supply of high-quality human-written text had been overtaken by demand from frontier labs. Synthetic AI training data, meaning text generated by other large language models, was being floated as the obvious answer.<\/p>\n\n\n\n<p>A paper published that month on arXiv, &#8220;Regurgitative Training: The Value of Real Data in Training Large Language Models&#8221; by Zhang, Qiao, Yang and Wei, set out to test whether synthetic AI training data actually works. The findings are sobering for anyone betting their roadmap on synthetic shortcuts, and they have particular weight for an emerging market like Vietnam. (Related: see our take in <a href=\"https:\/\/blog.datacore.vn\/blitzscaling-in-a-world-of-ai-and-why-vietnam-needs-its-own-datacore\/\">Blitzscaling in a world of AI<\/a>.)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The experiment, plainly<\/h2>\n\n\n\n<p>The authors ran two parallel tests of AI training data quality. They fine-tuned GPT-3.5 on a machine translation task using either real human translations or text generated by other LLMs. Then they trained transformer models from scratch under the same conditions. In both cases, models trained on machine-generated AI training data performed worse than those trained on human data.<\/p>\n\n\n\n<p>That alone is not surprising. What matters is the size of the gap and what is driving it. The authors point to two mechanisms in synthetic AI training data. The first is straightforward: LLM-generated training data carries a higher error rate than the human equivalent. The second is more interesting: LLM output has lower lexical diversity. Machine-generated text repeats itself, in a sense, more than humans do. Train on that narrow distribution and you inherit it.<\/p>\n\n\n\n<p>The team did not stop at diagnosis. They tested three remedies. They built a quality metric and fed high-quality machine data to the model first. They mixed outputs from several different LLMs to widen the lexical range. They trained a classifier to detect which synthetic samples looked most human, and ordered training accordingly. Each method helped. None of them fully closed the gap. The paper&#8217;s conclusion is blunt: real, human-generated AI training data &#8220;cannot be easily substituted by synthetic, LLM-generated data.&#8221;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why this matters more for Vietnamese AI than for English<\/h2>\n\n\n\n<p>In English, the supply of authentic human text is enormous. A team building a frontier model can paper over a lot of synthetic AI training data weakness through sheer volume. In Vietnamese, that volume does not exist. Indexed Vietnamese web text is a small fraction of indexed English text, much of it duplicated, machine-translated, or auto-generated. The base distribution any Vietnamese-focused model is drawing from is already narrow. We touched on the broader Vietnam data scarcity question in <a href=\"https:\/\/blog.datacore.vn\/from-raw-data-to-strategic-advantage-how-datacore-turns-information-into-decisions\/\">From Raw Data to Strategic Advantage<\/a>.<\/p>\n\n\n\n<p>If you accept the result, the implication for Vietnamese AI work is uncomfortable. Synthetic-data shortcuts compound an existing diversity problem rather than solve it. Every additional generation of model output added to a training set narrows the distribution a little more. The regurgitation effect, mild in English, is sharper in a small-language setting.<\/p>\n\n\n\n<p>The same logic applies to fine-tuning and alignment. RLHF preferences scraped from machine outputs reflect machine preferences. Hallucination corrections drafted by an LLM tend to repeat the LLM&#8217;s blind spots. Supervised fine-tuning examples auto-written by a model often look fluent and miss the cultural register a human annotator would catch immediately. The error compounds across each pass of low-quality AI training data.<\/p>\n\n\n\n<p>There is a concrete texture to this for Vietnamese specifically. The language has six tones, regional dialects with measurably different lexicons across the North, Central and South, code-switching with English and Chinese loanwords in technical domains, and a writing system that is mostly diacritic-stripped on social platforms. Each of those features carries real information that a current LLM does not reliably reproduce. A synthetic training corpus written by a model that already smooths these features will train the next model to smooth them further.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What is actually being built in Vietnam&#8217;s AI training data sector<\/h2>\n\n\n\n<p>Domestic infrastructure for human-generated AI training data is beginning to appear. Crowdsourcing platforms, university partnerships, and small specialist agencies are starting to compete for a market that did not meaningfully exist five years ago. Why these communities form and how they sustain output is a question we explored separately in <a href=\"https:\/\/blog.datacore.vn\/open-innovation-in-the-age-of-data-why-people-participate-and-what-smart-platforms-enable\/\">Open Innovation in the Age of Data<\/a>.<\/p>\n\n\n\n<p>One example is <a href=\"http:\/\/questlab.vn\" data-type=\"link\" data-id=\"questlab.vn\" target=\"_blank\" rel=\"noopener\">QuestLab<\/a>, a Vietnam-built crowdsourcing platform that describes itself on its homepage as &#8220;Vietnam&#8217;s leading community data platform&#8221; and reports a network of more than 50,000 verified contributors. Its public catalogue covers the categories that map directly to the gaps Zhang et al. identify: RLHF preference data and reward modelling, hallucination audits with fact-checking against trusted sources, supervised fine-tuning instruction datasets, image and video annotation, OCR for handwritten and structured Vietnamese documents, multi-region Vietnamese speech recording, and content moderation. It also runs market research and field operations such as retail audits and mystery shopping, which is a useful reminder that human-data infrastructure rarely scales on AI demand alone.<\/p>\n\n\n\n<p>What is worth flagging objectively is that platforms like this operate at scale only because they sit on top of distributed contributor networks. A 50,000-person contributor base is not a marketing number for an AI team. It is a practical constraint on how many parallel labelling lines a project can run, how many dialects a speech dataset can cover, and how quickly a hallucination audit can be turned around. The same is true of regional rivals in Southeast Asia.<\/p>\n\n\n\n<p><a href=\"http:\/\/questlab.vn\" data-type=\"link\" data-id=\"questlab.vn\" target=\"_blank\" rel=\"noopener\">QuestLab <\/a>is one of several players. The point is not that any single vendor solves the AI training data problem the paper raises, but that the paper has changed the cost calculation for buyers. If a Vietnamese AI team budgets purely for compute and synthetic pipelines and treats human-data spend as discretionary, the evidence suggests that team is structurally disadvantaged against one that does not. Procurement teams running these contracts can borrow discipline from the supplier verification approach we outlined in <a href=\"https:\/\/blog.datacore.vn\/vietnamese-supplier-verification-2026-playbook\/\">Vietnamese supplier verification in 2026<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The harder question<\/h2>\n\n\n\n<p>The three mitigation strategies the authors test deserve a closer look, because they are the strategies most enterprise AI teams reach for when human-data budgets get cut. Order high-quality synthetic data first. Mix outputs from multiple model families. Filter by human-likeness. All three are reasonable. All three help. None of them are full substitutes. The most plausible reading of the paper is that synthetic AI training data is a useful complement to real data, never a replacement.<\/p>\n\n\n\n<p>That framing is not new. It echoes findings from Shumailov et al. (Nature, 2024) on &#8220;model collapse&#8221; under recursive training, and earlier work on data diversity in machine translation. What Zhang et al. add is a clean, controlled experiment on a model class people actually deploy, and a clear statement that the published mitigations for synthetic AI training data are not yet enough.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1264\" height=\"784\" src=\"https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations.png\" alt=\"Three mitigation strategies tested for synthetic AI training data\" class=\"wp-image-1013\" srcset=\"https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations.png 1264w, https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations-300x186.png 300w, https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations-1024x635.png 1024w, https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations-768x476.png 768w, https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/03-mitigations-18x12.png 18w\" sizes=\"auto, (max-width: 1264px) 100vw, 1264px\" \/><figcaption class=\"wp-element-caption\">Three mitigations tested. All three help. None fully close the gap for AI training data.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Three proven costs Vietnamese AI teams should price in<\/h2>\n\n\n\n<p>Tying the paper back to procurement reality, three concrete costs follow from over-reliance on synthetic AI training data, and Vietnamese teams should price each of them in.<\/p>\n\n\n\n<p><strong>Cost one: accuracy decay on downstream metrics.<\/strong> Higher error rates in synthetic AI training data flow straight into model outputs. For applications where accuracy matters (legal, medical, financial), that decay is a direct cost in correction, refunds, and trust loss.<\/p>\n\n\n\n<p><strong>Cost two: lexical and cultural blandness.<\/strong> The diversity collapse the authors document means Vietnamese-language models trained on heavily synthetic AI training data will sound flatter, miss regional nuance, and underperform on long-tail vocabulary. The cost shows up in user retention and brand perception.<\/p>\n\n\n\n<p><strong>Cost three: compounding model debt.<\/strong> Each retraining pass on cheaper synthetic AI training data narrows the distribution further. The cost is invisible quarter to quarter but accumulates as a real ceiling on what the next model generation can do.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What to watch over the next twelve months<\/h2>\n\n\n\n<p>Three things are worth tracking. First, whether the published mitigations for synthetic AI training data improve. There is a plausible technical path to closing more of the gap, through better quality metrics, better detection classifiers, and ensembling at scale, and any progress in those directions lowers the premium on human data. Second, whether Vietnamese-language data buyers start contracting differently. A shift from per-task pricing to retainer-based contributor-network access would be one signal that the buying side is internalising the result. Third, whether regulators in Hanoi start treating training data provenance as a compliance question. They have not yet.<\/p>\n\n\n\n<p>For now, the practical takeaway is the unromantic one. Real human AI training data is expensive for a reason, and the cost of skipping it is no longer hypothetical. For Vietnamese teams in particular, building on a base distribution that is small to begin with, the case for investing in domestic human-data infrastructure is harder to argue against than it was a year ago.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<p>Zhang, J., Qiao, D., Yang, M., &amp; Wei, Q. (2024). <em>Regurgitative Training: The Value of Real Data in Training Large Language Models.<\/em> arXiv:2407.12835. <a href=\"https:\/\/arxiv.org\/abs\/2407.12835\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/abs\/2407.12835<\/a><\/p>\n\n\n\n<p>Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., &amp; Gal, Y. (2024). <em>AI models collapse when trained on recursively generated data.<\/em> Nature.<\/p>\n\n\n\n<p>QuestLab. (Accessed May 2026). <em>QuestLab: Vietnam&#8217;s Leading Community Data Platform.<\/em> <a href=\"https:\/\/questlab.vn\" target=\"_blank\" rel=\"noopener\">https:\/\/questlab.vn<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A 2024 paper proves AI training data from LLMs underperforms human-written text. 3 hidden costs for Vietnam AI teams and the case for real human data.<\/p>\n","protected":false},"author":5,"featured_media":1010,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","_swt_meta_header_display":false,"_swt_meta_footer_display":false,"_swt_meta_site_title_display":false,"_swt_meta_sticky_header":false,"_swt_meta_transparent_header":false,"footnotes":""},"categories":[6,57],"tags":[],"class_list":["post-993","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","category-cong-nghe"],"uagb_featured_image_src":{"full":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured.png",1240,640,false],"thumbnail":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured-150x150.png",150,150,true],"medium":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured-300x155.png",300,155,true],"medium_large":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured-768x396.png",768,396,true],"large":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured-1024x529.png",1024,529,true],"1536x1536":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured.png",1240,640,false],"2048x2048":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured.png",1240,640,false],"trp-custom-language-flag":["https:\/\/blog.datacore.vn\/wp-content\/uploads\/2026\/05\/00-featured-18x9.png",18,9,true]},"uagb_author_info":{"display_name":"Mike","author_link":"https:\/\/blog.datacore.vn\/en\/author\/mike\/"},"uagb_comment_info":0,"uagb_excerpt":"A 2024 paper proves AI training data from LLMs underperforms human-written text. 3 hidden costs for Vietnam AI teams and the case for real human data.","_links":{"self":[{"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/posts\/993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/comments?post=993"}],"version-history":[{"count":7,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/posts\/993\/revisions"}],"predecessor-version":[{"id":1030,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/posts\/993\/revisions\/1030"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/media\/1010"}],"wp:attachment":[{"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/media?parent=993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/categories?post=993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.datacore.vn\/en\/wp-json\/wp\/v2\/tags?post=993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}