Unravelling the FineTuning Odessy in LLM’s

Rajesh K
Generative AI
Published in
14 min readFeb 13, 2024

--

Large language models (LLMs) have revolutionized natural language processing by offering sophisticated solutions and advanced capabilities. Trained on extensive text datasets, these models excel at various tasks such as generating text, translating, summarizing, and answering questions. Despite their prowess, LLMs may not models (LLMs) have revolutionized natural language processing by offering sophisticated solutions and advanced capabilities. Trained on extensive text datasets, these models excel at various tasks such as generating text, translating, summarizing, and answering questions. Despite their prowess, LLMs may not always suit specific tasks or domains.

Fine-tuning emerges as a solution to adapt pre-trained LLMs to specialized tasks. Through fine-tuning, users can enhance a model’s performance on a particular task by training it on a smaller dataset tailored to that task, all while retaining its broad language understanding. For instance, a study conducted by Google revealed that fine-tuning a pre-trained LLM for sentiment analysis led to a 10 percent increase in accuracy.

In this article, we delve into the significant enhancements achievable through fine-tuning LLMs, which include improved model performance, reduced training expenses, and the facilitation of more precise and context-specific outcomes. Furthermore, we explore various fine-tuning techniques and applications to underscore the crucial role fine-tuning plays in LLM-powered solutions.

Key stages involved in LLM Application Lifecycle

Language models, similar to intricate systems, undergo a well-defined lifecycle encompassing four pivotal stages:

Initiation:

  • Defining the Problem: Clearly articulating the purpose and specific task that the language model (LLM) is intended to excel in.
  • Collecting Data: Accumulating extensive and pertinent textual data relevant to the specified task, ensuring diversity, high quality, and meticulous curation.
  • Cleaning & Preprocessing Data: Refining the data by eliminating errors, inconsistencies, and biases to guarantee that the LLM learns from a set of clean and representative information.
  • Selecting Model Architecture: Opting for a suitable LLM architecture (e.g., Transformer-based) that aligns with desired capabilities and the available computational resources.

Experimentation:

  • Pre-training: Training the LLM on the compiled data using algorithms like backpropagation to establish fundamental language understanding and relationships between words and concepts.
  • Fine-tuning: Further refining the LLM on task-specific data to customize its abilities for the designated purpose, involving adjustments to parameters and techniques such as transfer learning.
  • Evaluation & Analysis: Monitoring the LLM’s performance on benchmarks and specific tasks, identifying strengths, weaknesses, and areas for improvement.

Evaluation & Refinement:

  • Bias Detection & Mitigation: Scrutinizing the LLM for potential biases in the training data and implementing techniques to minimize their impact.
  • Safety & Fairness Assessment: Ensuring that the LLM’s outputs are safe, fair, and aligned with ethical considerations, potentially involving human oversight and the establishment of guardrails to prevent harmful or discriminatory outputs.
  • Explainability & Interpretability: Striving to comprehend the reasoning behind the LLM’s outputs, making its decision-making process more transparent to foster trust and accountability.

Production & Deployment:

  • Infrastructure Setup: Deploying the LLM into a production environment with the requisite hardware and software configurations for efficient usage.
  • Monitoring & Maintenance: Continuously overseeing the LLM’s performance, addressing issues such as drift and decay, and conducting regular retraining as necessary.
  • Updates & Improvements: Incorporating new data, advancements in LLM architecture, and feedback to enhance the model’s capabilities and address evolving needs.

LLM Training and Tuning

Modern language models are trained in a multi-step process as shown in above figure :

  1. Self-supervised pre-training: Models are trained on large unlabeled corpora to develop general linguistic capabilities through objectives like auto-encoding or masked language modeling.
  2. Task-specific fine-tuning: The pre-trained models are then fine-tuned on downstream tasks by adding task-specific prediction heads and continuing self-supervised signals as additional supervision.

This staged transfer learning approach allows models to build capabilities in steps — first learning general language representations, then adapting to specialized end tasks. The specifics of the training process vary across model types and architectures, but the broad phases of self-supervised pre-training followed by supervised task fine-tuning are consistent.

Pretraining

Here are some common types and their suitability for encoder and decoder models:

1. Masked Language Modeling (MLM):

  • Goal: Predict masked words in a sentence based on surrounding context.
  • Strengths: Encourages model to understand relationships between words.
  • Applicability: Both encoder and decoder models can benefit, especially encoders for language understanding.

2. Replaced Token Detection (RTD):

  • Goal: Identify and replace randomly replaced tokens in a sentence.
  • Strengths: Improves understanding of word meaning and context.
  • Applicability: Primarily beneficial for encoders in understanding semantic relationships.

3. Next Sentence Prediction (NSP):

  • Goal: Predict if two given sentences are consecutive in a text.
  • Strengths: Encodes sentence-level coherence and discourse structure.
  • Applicability: More useful for encoder models to learn long-range dependencies and structure.

4. Contrastive Learning with Negative Sampling (CLNS):

  • Goal: Learn positive and negative representations of similar and dissimilar sentences.
  • Strengths: Captures nuanced semantic relationships and improves generalization.
  • Applicability: Can benefit both encoders and decoders, although more research is ongoing for decoder applications.

5. Autoregressive Language Modeling:

  • Goal: Predicts the next word in a sequence given previous words.
  • Strengths: Learns basic language patterns and syntax.
  • Applicability: Primarily used for pre-training decoder models for language generation tasks.

Alignment step fine-tunes generative models to meet expected human norms. It involves two sequential steps:

  1. Supervised Fine-Tuning (SFT): Models are trained on high-quality reference texts to mimic desired response structure and style.
  2. Reinforcement Learning from Human Feedback (RLHF): Models generate texts, humans provide feedback scoring appropriateness, and models are trained to maximize feedback scores, thus learning societal preferences.

SFT teaches models structursl norms and RLHF shapes model outputs to suit user expectations regarding helpfulness, harmlessness, instruction following, etc. Together, they align generative language model behavior with human values.

Finetuning

Large language models are first pre-trained to gain broad linguistic capabilities. Then, transfer learning fine-tunes the models for specialized tasks. This two-step process is efficient:

  • Pre-trained models are available publicly as strong starting points for fine-tuning.
  • Specialization requires less data/compute than initial pre-training.
  • Fine-tuned models outperform training from scratch.

However, fine-tuning can still be expensive for massive models. More efficient tuning would allow wider access to capable large language models.

The goal is to make specialization:

  • More economical computationally
  • Feasible with fewer/smaller GPUs
  • Quicker regarding training time

Enabling broader practitioners to fine-tune quality models despite limited resources. The techniques focus on efficiency — doing more with less. Making large language model capabilities achievable without massive investment.

Before delving in to Fine tuning methods lets understand on Quantization.

Quantization in LLMs (Large Language Models) is a technique used to reduce the memory footprint and computational cost of these models, making them smaller and faster to run. This is achieved by representing numerical values with lower precision data types, like 8-bit integers instead of 32-bit floats, without significantly sacrificing accuracy.

Quantization essentially approximates the information stored in high-precision numbers (typically 32-bit floats) with lower-precision data types. This can be integers with fewer bits (like 8-bit) or other compressed representations.

Quantization Methods:

Below are detials of two of the most popular methods used

1. Post-training Quantization (bitsandbytes):

  • Converts a pre-trained model to lower precision after training.
  • Simpler to implement, no need to modify the training process.
  • Achieves decent accuracy retention but usually lower compression ratios compared to other methods.

2. Quantization-aware Training (GPTQ):

  • Integrates quantization into the training process itself.
  • Requires modifying the training pipeline and model architecture.
  • Can achieve higher compression ratios and speed improvements with potential slight accuracy drops.

Conventional Fine tuning methods

Instruction Fine Tuning

This involves feeding the LLM additional training data specific to the target task alongside instructions explaining what kind of output is desired. The model’s internal parameters are then adjusted based on this data to improve its performance on the specific task.

Instruction fine-tuning aims to improve model performance on tasks by directly showing desired behaviors. It trains models on input-output examples that provide demonstrations responding to queries.

The training data structure maps instructions to meaningful responses. For summarization, instances would be like:

Input: “Summarize this passage:” <text> Output: <concise summary>

For translation:

Input: “Translate this to French:” <sentence in English> Output: <translation in French>

These structured pairs with prompts instructing intended model actions allow tailoring specialty “thinking” for tasks. Models learn niche skills fitting user needs versus just general capabilities.Explicit prompt-completion demonstrations enable models to aptly serve specific functions as instructed. The key is training data that guides by concretely showing how models should respond when users ask for something.

Full Fine tuning

The simplest approach is full fine-tuning, where we retrain the entire model end-to-end on the new data, updating all parameters. However, this has some drawbacks:

  • The fine-tuned model retains the full parameter set, becoming cumbersome for large models like LLMs.
  • We must save all parameters whenever we retrain or apply the model to new tasks.
  • It demands substantial compute and memory to train the whole model.
  • More hyperparameter tuning or data may be needed to avoid overfitting and get optimal performance.

Repeatedly retraining or fine-tuning the same large model across many tasks multiplies these issues. We end up with multiple copies of an already gigantic model to store and deploy. Managing many independent versions of a huge model poses challenges.

Benefits

  • Efficient Data Use — Fine-tuning taps into pretrained knowledge, allowing a small dataset to be effective.
  • Performance Gains — Tailoring the model to a domain enables it to capture niche nuances and specifics.
  • Increased Versatility — More exposure to edge cases makes the model handle a greater variety of inputs correctly.

Drawbacks

  • Massive Compute Needs — Updating all parameters of a huge model requires vast amounts of computing power.
  • Impractical Hardware Demands — Large models need specialized high-memory chips which can be costly and unavailable.
  • Time & Complexity Burdens — Distributing training across multiple GPUs takes expertise. Fine-tuning is a lengthy process for very large models.

A common alternative to full fine-tuning is only fine-tuning the output layers. With this approach, we leave the parameters of the pretrained language model frozen. We only update the weights of the newly appended output layers during training, similar to training a simple logistic regression or small neural network classifier on top of extracted embeddings. Rather than tuning the entire massive model end-to-end, we leverage it as a fixed feature extractor and solely adjust the small feed-forward output adaptation layers for the downstream task. This allows us to reuse the pretrained model’s representations while specializing the classifier layers for our specific dataset and objectives.

Prefix Tuning

Prefix tuning uses prompts in a novel way — by allowing end-to-end optimization of a continuous prompt embedding. This prompt sequence adapts the language model to downstream tasks lightly and efficiently, without the need for extensive hyperparameter searches. By injecting a learnable prefix, we can insert task-specific knowledge and guidance into the pretrained model. Unlike traditional prompts, the prefix gets updated based on the dataset rather than needing manual tuning. This provides a lightweight yet optimized method to steer these powerful models.

Source : https://arxiv.org/pdf/2101.00190.pdf

Adapters

Standard fine-tuning takes a pre-trained network and adjusts all the weights to a new task, requiring separate tuned parameter sets for each downstream use case. While some layers may have shared usefulness across tasks, each adapted model is trained from scratch.

The adapter approach introduces small modular adapters inserted between the pretrained model’s layers. Rather than overwriting weights, the base model parameters stay fixed, copied from pre-training. Only the compact adapters contain task-specific trainable parameters. This enables efficiently reusing the frozen pretrained representations while learning customizable transformations for each individual task in the adapters. New tasks can be added by appending new dedicated adapter modules without interfering with previous ones. So knowledge gained during pretraining is preserved alongside specialized task adaptations.

https://arxiv.org/pdf/1902.00751.pdf

Limitations in Adapters and Prefix tuning

  • Adapters add additional layers that create inference latency and slow training, despite having few parameters. The extra layers must be processed sequentially.
  • Prefix tuning tends to be more difficult to optimize and stabilize during fine-tuning. The process is less stable.
  • Prefixes reduce the available context window for the model which can undermine performance.
  • Neither approach guarantees performance gains that scale linearly with more parameters tuned. Monotonic improvement is not ensured.
  • Prefix tuning constrains the model architecture and context, while adapters can become computational bottlenecks that reduce runtime efficiency.

So in summary, while reducing computational requirements from full fine-tuning, these approaches come with drawbacks around trainability, model architecture, and inference efficiency.

Parameter-Efficient Finetuning Methods

The core concept behind prompt tuning and other parameter-efficient methods is to introduce a small set of new parameters into a pretrained language model, and only fine-tune those to adapt the model for better performance on a target dataset and task. This allows specialization to new domains and objectives without extensively updating the full model.

Source : https://magazine.sebastianraschka.com/

Given our analysis so far, an ideal fine-tuning technique should satisfy several key criteria:

  1. Computational Efficiency — The training procedure must be fast with low computational requirements.
  2. Memory Efficiency — Massive GPUs should not be necessitated to fine-tune language models.
  3. Ease of Deployment — Multiple copies of a model for each specialized task should not need to be deployed.

The Low-Rank Adaptation (LoRA) fine-tuning method fulfills these . LoRA enables efficient adaptation comparable to full fine-tuning, easy switching between model versions, and no inference latency costs. Moreover, as a practically useful approach, LoRA has received much research community attention, with many variants and extensions

https://arxiv.org/abs/2106.09685

Low-rank Adaptation (LoRA) is a technique to enable parameter-efficient fine-tuning. It works by introducing low-rank matrices along with the original full-rank weight matrices in the model. The core concept behind LoRA is to update the model’s parameters using a low-rank decomposition, which is implemented by adding two linear projection matrices. LoRA keeps the pretrained layers of the large language model fixed, and inserts a trainable low-rank matrix into each layer of the model

As illustrated above, the decomposition of ΔW means that we represent the large matrix ΔW with two smaller LoRA matrices, A and B. If A has the same number of rows as ΔW and B has the same number of columns as ΔW, we can write the decomposition as ΔW = AB. (AB is the matrix multiplication result between matrices A and B.)( Soure — https://magazine.sebastianraschka.com)

Here are two libraries that can be used to finetune large language models with LoRA:

LoRA Variations

Q-LoRa

QLoRA is a technique that makes it possible to fine-tune very large language models on a single GPU by dramatically reducing memory requirements. It does this by using 4-bit quantization of the parameters in the pretrained model, which allows the model to be stored using far less memory while retaining performance. During fine-tuning, the 4-bit quantized pretrained model is held frozen, and gradients are propagated through it into a small set of adapter layers called Low Rank Adapters (LoRA). Clever innovations like representing values in a new 4-bit data type called NormalFloat, and double quantization of parameters, minimize memory usage even further. By enabling complex models to be fine-tuned on widely available hardware, QLoRA breaks down barriers to working with state-of-the-art large language models.

https://arxiv.org/abs/2305.14314

QA-LoRA

QA-LoRA builds upon LoRA/QLoRA to significantly reduce the computational weight of training and deploying large language models (LLMs). It achieves this by combining two key techniques:

  1. Parameter-Efficient Finetuning (PEFT): This approach involves training pre-trained LLMs with a limited number of tunable parameters. LoRA is one example of a PEFT method.
  2. Quantization: Here, the trained weights of an LLM are converted into more compact, lower-precision representations

Because such quantization is applied during training, there is no need for post-training quantization

https://arxiv.org/abs/2309.14717

LongLoRA

LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs) with limited computation cost. It is designed to speed up the context extension of LLMs by using sparse local attention during fine-tuning, which leads to non-trivial computation savings with similar performance to fine-tuning with vanilla attention

LongLoRA uses two key techniques:

  • Shifted short attention: This method focuses the model’s attention on the most relevant parts of the context, reducing the computational cost.
  • Supervised fine-tuning: The model is fine-tuned on a dataset of long-context question-answer pairs, helping it learn to effectively use the extended context.
https://arxiv.org/abs/2309.12307

The success of LoRA has sparked various innovative advancements:

  • LQ-LoRA: This work refines LoRA’s quantization scheme within QLoRA, leading to superior performance and the ability to adjust to specific memory limitations. [Source: https://arxiv.org/abs/2205.14800]
  • MultiLoRA: Expanding on LoRA, this approach excels in handling complex scenarios requiring learning across multiple tasks. [Source: https://arxiv.org/abs/2207.11107]
  • LoRA-FA: To minimize memory usage further, LoRA-FA “freezes” half of the low-rank decomposition matrix, specifically the A matrix within the product AB. [Source: https://arxiv.org/abs/2207.10047]
  • Tied-LoRA: By employing weight tying, this method enhances the parameter efficiency of LoRA even further. [Source: https://arxiv.org/abs/2207.13180]
  • GLoRA: This extension builds upon LoRA by adapting both pretrained model weights and activations to individual tasks, in addition to utilizing an adapter for each layer. [Source: https://arxiv.org/abs/2209.00608]

A Wrap-up

Both Lora and Qlora are techniques for efficiently fine-tuning Large Language Models (LLMs), meaning they allow you to adapt a pre-trained LLM to a specific task without requiring massive amounts of data or computational resources.

The best choice depends on your specific needs:

  • If memory is a major constraint: Qlora is the clear winner due to its significant memory efficiency.
  • If fine-tuning speed is crucial: Lora might be preferable due to its faster training times.
  • If both memory and speed are important: Qlora offers a good balance between both.
  • If cost is a factor: Lora is slightly less expensive than Qlora.

Several other parameter-efficient fine-tuning techniques exist, each with its own strengths and weaknesses. Choosing the best option depends on your specific needs and resources. It’s always good practice to experiment and compare different approaches to find the optimal fit for your project.

Going ahead, we’ll delve into some hands-on PEFT code examples. By working through real-world implementations, you’ll gain a deeper understanding of how these techniques function and apply them to your own projects.

Thank you for joining me on this exploration!

--

--