Skip to main content

Introduction to Large Language Models

Introduction of Large Language Models is a significant step forward in the journey of natural language processing (NLP) and artificial intelligence technologies. The journey begins with the early developments in language modeling, where traditional approaches relied on rule-based systems and statistical methods.

First of the significant breakthroughs in deep learning in the context of natural language processing (NLP) were the introduction of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. This subsequently paved the way for more sophisticated language models.

Advent of Transformer architecture and its application to language modeling gained traction. 2017 marked a paradigm shift in NLP, The introduction of the transformer model in the seminal paper Attention is All You Need by Vaswani et al. introduced the attention mechanism, allowing the model to efficiently capture contextual relationships between words in a sequence, overcoming the limitations of sequential processing in RNNs.

The application of transformer architecture along with attention mechanism gained momentum. The real change was noticed in 2020, When OpenAI released GPT-3 (Generative Pre-trained Transformer). What was particularly noteworthy was its massive scale, subsequently following it up with bigger versions of the series (GPT-3.5, GPT-4) boasting 175 billion parameters to 1.7 trillion parameters. The pre-training strategy, where models are trained on diverse and massive datasets before fine-tuning for specific tasks, proved to be a game-changer. GPT-3, GPT-3.5 and GPT-4 demonstrated unparalleled capabilities in natural language understanding, generation, and what was demonstrated was an implicit reasoning, that these models demonstrated at a massive scale.

These large language models have found applications in various domains, such as language translation, question-answering, text completion, and even creative writing. Their ability to understand context, generate coherent text, and perform diverse language tasks has led to widespread adoption in industry and research.

While the implication of language models in various spheres of informatics and it's applications have serious potential, the introduction of large language models has also raised ethical concerns, including biases present in training data, potential misuse, and the environmental impact of training such massive models. It should also be noted, that it's not very clear as to how these models demonstrate common sense reasoning, given the underlying mechanism isn't truly based on any algorithm or path breaking research in deep neural reasoning. Therefore, these models with all of their goodness are often mired with hallucination, biases and outright wrong answers.

Despite that, these models in their current form are quite useful for the improvements compared to previous generation of NLP methods.

A brief on Encoders and Decoders

Encoders

Encoders are responsible for processing the input sequence and extracting relevant information from it. It consists of multiple layers, each containing self-attention mechanisms and feedforward neural networks.

The self-attention mechanism allows the encoder to consider the entire input sequence simultaneously, capturing contextual relationships between words.

The output of the encoder is a set of context-aware representations for each element in the input sequence. These representations, often called encoder hidden states, contain information about the input sequence's content and structure.

On the other hand, Decoders takes the context-aware representations generated by the encoder and produces the output sequence step by step.

Decoders

Similar to the encoder, the decoder comprises multiple layers with self-attention mechanisms and feedforward networks.

In addition to the self-attention mechanism, the decoder uses an attention mechanism to focus on different parts of the input sequence (encoder hidden states) while generating each element of the output sequence.

During training, the decoder is provided with the target sequence (or a shifted version of it) to learn the mapping from input to output. During inference, the generated output is fed back into the decoder for autoregressive sequence generation.

Attention Mechanism

Both the encoder and the decoder use attention mechanisms to selectively focus on different parts of the input sequence when processing each element. The attention mechanism helps the model capture long-range dependencies and understand the relationships between words in the input and output sequences.

Traditionally, when computers processed sentences or sequences of words, they did so one word at a time, in a linear fashion. This approach had limitations in capturing the relationships between different words in a sentence.

The "Attention" mechanism introduced in this seminal paper Attention is all you need by Vaswani et al changes that. Imagine reading a sentence and, instead of going through it word by word, you could pay more attention to certain words that are more relevant to understand the meaning of the sentence. This is similar to how humans focus on specific words or parts of a sentence to grasp the overall message. So, "Attention is All You Need" essentially suggests that by allowing the computer to focus on important parts of a sentence, rather than processing words sequentially, we can achieve much more effective and nuanced language understanding in artificial intelligence systems.

Transformer Architecture and it's types

Transformer architectures can be categorized into three main types based on their functionality: encoder-only, decoder-only, and encoder-decoder. Each type is designed to address specific tasks and requirements. Here's an overview of each:

Encoder Decoder transformers

  • Transformer (Original): The original transformer introduced in "Attention is All You Need" features both an encoder and a decoder. This architecture is commonly used for sequence-to-sequence tasks, such as machine translation.

  • T5 (Text-to-Text Transfer Transformer): T5 represents a unified approach to NLP tasks by framing all tasks as text-to-text problems. It uses a transformer architecture where both input and output are treated as text, allowing it to handle various tasks through a consistent paradigm.

  • BART (Bidirectional and Auto-Regressive Transformers): BART is a sequence-to-sequence model with an encoder-decoder architecture. It is trained on both autoencoding and autoregressive objectives, making it suitable for tasks like text summarization and generation.

Encoder only Transformers

Encoder-only models are designed to generate contextualized representations of input sequences. Unlike traditional models that process sequences from left to right or right to left, encoder-only models consider the entire input sequence in both directions simultaneously.

They are well-suited for tasks where understanding bidirectional context is essential, such as sentence embeddings, contextualized word representations, and downstream tasks like text classification or sentiment analysis. These models typically involves using the encoded representations for downstream tasks. They are not inherently designed for sequence generation.

Following are broad steps involved in training encoder only models.

  1. Input Embeddings: The input sequence is tokenized into individual units (words or subwords), and each token is assigned an embedding vector.

  2. Encoder layers: The core of the encoder-only model consists of multiple layers of transformer blocks. Each transformer block contains two main components: self-attention mechanism and feedforward neural networks.

  3. Layer stacking: Multiple transformer layers are stacked on top of each other. Each layer refines the contextualized representations obtained from the previous layer.

  4. After processing through all layers, the model generates contextualized representations for each token in the input sequence. These contextualized representations contain information about the surrounding context of each token, allowing the model to understand the meaning and relationships within the input sequence.

  5. Encoder-only models are typically pre-trained on large amounts of unlabeled text using tasks like masked language modeling. During pre-training, some tokens are randomly masked, and the model is trained to predict these masked tokens based on the surrounding context. This objective helps the model learn rich contextual representations.

  6. Fine-tuning for specific tasks:

Once pre-trained, the encoder-only model can be fine-tuned for specific downstream tasks such as sentiment analysis, named entity recognition, or question answering. During fine-tuning, task-specific layers may be added, and the model is trained on labeled data for the target task.

Examples of encoder only models

  • BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model designed for pre-training on large amounts of unlabeled text. It captures bidirectional contextual information by considering both left and right context during training. BERT has been highly successful in various NLP tasks, including question answering and sentiment analysis.

  • DistillBERT: DistillBERT is a smaller and more efficient version of BERT, designed for faster inference and reduced resource requirements while maintaining competitive performance.

  • RoBERTa (Robustly optimized BERT approach): RoBERTa is an optimized version of BERT that modifies key hyperparameters and removes the Next Sentence Prediction objective during pre-training. This results in improved performance on downstream tasks.

Decoder only Transformers

Decoder only models take in the context (often generated by an encoder) and produce output sequence token by token, taking into account the context of previously generated tokens. They are auto-regressive in nature, predicting one token at a time. Unlike the encoder model, however, decoder only models typically use masked self-attention mechanism during the training to ensure that each position can only attend to previous positions, enforcing the autoregressive generation process.

These models excel in tasks that involve sequence generation, such as text completion, language modeling, machine translation, and text summarization.

Decoder only models, such as the decoder part of autoregressive models like GPT (Generative Pre-trained Transformer), are designed to generate sequences of tokens in an autoregressive manner. These models are often used for tasks where the output depends on the entire input sequence, and the goal is to generate a sequence token by token.

  • GPT (Generative Pre-trained Transformer):

    GPT-2: This model, an improvement upon the original GPT, is characterized by a massive number of parameters (1.5 billion) and demonstrated superior performance in tasks such as language modeling and text generation.

    GPT-3: With 175 billion parameters. GPT-3 achieved remarkable results in a wide array of NLP tasks and demonstrated the capabilities of extremely large language models.

    GPT-4: This is a multimodal large language model and the fourth in its series of GPT foundation models, created by OpenAI. It was launched on March 14, 2023 and made publicly available via the paid chatbot product ChatGPT Plus, via OpenAI's API. As a transformer-based model, GPT-4 uses a paradigm where pre-training using both public data and "data licensed from third-party providers" is used to predict the next token. After this step, the model was then fine-tuned with reinforcement learning feedback from humans and AI for human alignment and policy compliance.

Get the power of futuristic Data & AI Platform for your enterprise.