How AI is transforming Khmer language documentation
For most of the 20th century, Khmer lived more in the air than on the page. Court proceedings, hospital intake, university lectures, NGO field interviews, family histories - spoken in full, captured in fragments. The cost of writing it all down was simply too high.
At SMEAN, we’ve watched that cost collapse. An hour of Khmer audio that used to take a human transcriber a full working day - and somewhere around $80 to produce - now runs through our pipeline in minutes for less than the price of a coffee. That shift sounds technical, but it reshapes who gets to keep a record of what they said.
The ministry
A government ministry we work with used to summarize internal meetings from memory. Decisions made in Khmer were filed in Khmer-flavored English, and the original reasoning quietly evaporated. With transcription in the loop, the meeting itself becomes the document. Nothing is translated away before it’s recorded.
The NGO
Community health workers in the provinces conduct hundreds of interviews a year - the kind of conversations that used to be summarized in a notebook and then discarded. Those same interviews are now being preserved as a domain-specific corpus, feeding a health vocabulary that didn’t exist in any model six months ago. The interviews stopped being disposable the moment they became searchable.
The hospital
Patient intake in Khmer, dictated rather than typed, has cut documentation time for clinicians without forcing them to work in a second language. The records are richer because nobody is compressing them on the fly.
The common thread across all three: capturing speech used to be the bottleneck. Storage is cheap now. Search is cheap. Transcription is cheap. The hard question moved one step up the chain - what should we be listening for in the first place?
That’s the question we spend most of our time on now. The technology caught up. The judgment about what’s worth preserving is the work that’s left.