Engineering
A dream of three vector spaces
Jun 5, 2025
Vector spaces refer to the multi-dimensional space represented by a list of numbers. In math you'll have seen this with 2 dimensional spaces - [x, y] coordinates - or maybe 3 dimensional spaces - [x, y, z] coordinates.
These days, it's common to represent text in a 1000+ dimension vector space - [x, y, z, …, 999, 1000]. These are called vector embeddings, semantic embeddings, or often just "embeddings". You can convert single words, whole sentences, or nowadays whole books into a list of numbers that contain vast amounts of information contained.
We use vector embeddings a bunch at Phrasing, but as the industry evolves and as Phrasing matures, they're becoming less useful for our purposes.
One day, and hopefully one day soon, I would like to develop our own vector spaces to use alongside the current and future semantic embeddings, specifically for language learning: language embeddings, phonetic embeddings, and inflectional embeddings.
Language embeddings
Most applications just refer to different languages by their 2 letter language code. Occasionally, you'll get an additional 2 letters tacked onto the end for some regional localization.
However this so poorly encapsulates the differences between languages. Cantonese is often abbreviated as zh-HK, and Mandarin as zh-CN. These are two entirely unintelligible languages that are thus labeled as closer related than say Croatian and Serbian (hr and sr respectively), which are entirely mutually intelligible.
Language often exists on a continuum. We think of Spanish and French and Italian as entirely different languages, but as you start to add languages like Catalan and Occitan and Romansh the borders start to blur. The Dutch you speak on the border of Germany is different than the Dutch spoken in Amsterdam which is very different from the Dutch spoken in Rotterdam (by the time you've traveled from Amsterdam to Rotterdam you've already changed dialects 10 times already).
There's also a temporal aspect to language - people weren't talking about rizz 50 years ago, just like nobody calls anything groovy today.
Language classifications are just begging for vector spaces. I think the sooner we can encode languages into such a space, the better we'll be able to parse semantics, lemmas and morphology.
Phonetic embeddings
Perhaps the least related to semantic embeddings, I want to encode the general pronunciation of a word. Words like "eye" and "I" would end up in the same place.
"There" and "their" would be the same too, right next to "they're" (in some english accents this can be pronounced differently).
The tricky bit here would be to encode regional differences and accents. For example, when I say "there" and "their" you would not be able to hear a difference. However if I say "they're", it's quite distance. This isn't the case for all english accents though. And don't get me started on the great "pecan" debate.
The great part is linguists have been categorizing sounds for thousands of years - this is evident as far back as the earliest writings of Sanskrit.
Inflectional embeddings
This is perhaps the simplest, or the most complex of the embeddings.
On the simple end, this could just be a one-hot encoding of the different features of a word. Basically a long list of boolean values identifying all the grammatical features you are concerned with: [is-masculine?, is-feminine?, is-neuter?, …]
However, what I would like to be able to encode are the precise uses for different inflections. For example, some sentences in Russian use the Genative case where in German they use the Dative case.
Why these three embeddings?
If we combine these with semantic embeddings, we can start to not just be a tool to learn languages, but we can start to actually teach languages.
Ideally, we would like to be able to show sentience’s at precisely n+1 level. Even in a completely new language, we could find cognates with familiar grammar constructions, and introduce one new word or grammar structure.
We could set up drills where you train on different false friends, or places where the grammar structures differ. You could come up with sentence builders, taking familiar patterns and introducing new patterns with words they already know.
You could start to teach people hyper-localized language. English varies greatly from California, to Boston; London to Dublin; Melbourne to Perth. This happens in every language, no matter the geological size, and yet most people only think of learning Dutch with good dutch accent. My own theory is if you can focus in on these, then you can improve your accent and ability to communicate dramatically.
Any nerdy reasons?
Why yes, thank you for asking. I don’t see how this would be super relevant to language learning, but it would be really cool to see if we could map “Linguistic Drift” across both Space and Time. It would be so cool to use this to try and computationally reconstruct past languages.
As time progresses, one could analyze better at how languages evolve, and using what historical data we have from ancient languages like Latin and Greek and Chinese, perhaps produce a modal that could someday help in reconstructing Proto-Indo European or the Rongorongo language, dozens or hundreds of native american languages, or maybe find one or two new clues as to how Ancient Egyptian may have sounded.
Linguists have to learn little bits of tons of languages and look for patterns. What if instead, we could find the patterns, then go study those parts of the languages?