Humanity stands at a truly exciting juncture, propelled forward by the rapid advancements in Artificial Intelligence. As a Cloud Solution Architect at Microsoft, I’ve had a front-row seat to this evolution, especially within the realm of Language AI. What was once considered science fiction is now becoming an everyday reality, revolutionizing how we interact with information and technology.

The year 2022 was pivotal, with the public release of ChatGPT demonstrating the profound potential of Large Language Models (LLMs). This wasn’t just another chatbot; it was a paradigm shift that achieved 100 million active users in a mere two months, showcasing its unparalleled ability to transform tasks like text generation, translation, and summarization. The spotlight on LLMs has been intense, and for good reason, but it’s crucial to understand that the field of Language AI is far richer and more diverse than just LLMs.

What is Language AI?

At its core, Language AI, often used interchangeably with Natural Language Processing (NLP), is a subfield of AI dedicated to enabling computer systems to understand, process, and generate human language. It’s about bridging the gap between the nuanced complexities of human communication and the logical structures of machines. From recognizing speech to translating languages, Language AI empowers systems to perform tasks that were once exclusively within the domain of human intelligence.

A Glimpse into History: Representing Language for Machines

For computers, language is inherently unstructured. A sentence like “My cat is cute” is just a string of characters. To make sense of it, we need ways to represent this unstructured text in a structured, numerical format.

One of the earliest and foundational techniques is the Bag-of-Words model, which emerged around the 2000s. Imagine you have two sentences you want to analyze: “That is a cute dog” and “My cat is cute.”

  1. Tokenization: The first step of the Bag-of-Words model is tokenization, the process of splitting up sentences into individual words or subwords (tokens). For example, “My cat is cute” would be split into the tokens [“my”, “cat”, “is”, “cute”]. Similarly, “That is a cute dog” would become [“that”, “is”, “a”, “cute”, “dog”]. The most common method for tokenization is by splitting on a whitespace to create individual words. However, this has its disadvantages, as some languages, like Mandarin, do not have whitespaces around individual words.
  2. Vocabulary Creation: After tokenization, we combine all unique words from our sentences to create a vocabulary. For our two example sentences, the unique words would be [“that”, “is”, “a”, “cute”, “dog”, “my”, “cat”]. This forms a complete list of all words we know about.
  3. Vector Representation: Using this vocabulary, we then represent each sentence by counting how often each word from our vocabulary appears in it. This literally creates a “bag of words” for each sentence, represented as a numerical vector.
    • For the sentence “My cat is cute,” given our vocabulary, its vector representation might look something like this: [0, 1, 0, 1, 0, 1, 1]. (Here, ‘0’ would represent the absence of words like ‘that’ or ‘dog’, and ‘1’ would represent the presence of ‘is’, ‘cute’, ‘my’, ‘cat’, assuming a specific order for the vocabulary).
    • For “That is a cute dog,” its vector might be [1, 1, 1, 1, 1, 0, 0].

These numerical representations are also called vectors or vector representations. Throughout the book, we refer to these kinds of models as representation models.

While elegant and still useful in certain contexts, the Bag-of-Words model has a significant limitation: it treats words in isolation, completely ignoring their semantic meaning or the context in which they appear. “Dog” and “puppy” might be related, but a simple count won’t capture that.

This is where the journey of Language AI truly began to accelerate, leading to more sophisticated representations that understand the meaning of words. Stay tuned for the next post where we delve into these “dense vector embeddings” and how they capture the very essence of language!

By Saad Mahmood

Saad Mahmood is a Principal Cloud Solution Architect in Global Cloud Architecture Engineering (CAE) Team at Microsoft, with expertise in Azure and AI technologies. He is also an ex MVP of Microsoft for Azure, a recognition given to exceptional technical community leaders, and has authored a book titled "Cloud Native Application in .NET Core 2.0." Additionally, he is a popular speaker and actively contributes to the Microsoft Azure community through blogs, articles, and mentoring initiatives.

Leave a Reply

Your email address will not be published. Required fields are marked *