Retrieval Augmented Generation (RAG)

Posted on 2024-04-22 In LLM Views: Waline: Word count in article: 6.6k Reading time ≈ 6 mins.

What is RAG

Motivation:
- challenges when working with LLMs，such as domain knowledge gaps, factuality issues, and hallucination（幻觉）.
- Retrieval Augmented Generation (RAG) 通过引入外部知识源来增强LLMs以缓解上述问题，augmenting LLMs with external knowledge such as databases.
  - 一个关键的优势是RAG方法不需要为任务特定的应用重新训练LLM
- RAG can help reduce issues of hallucination or performance when addressing problems in a highly evolving environment.

定义

Untitled

RAG 会接受输入并检索出一组相关/支撑的文档，并给出文档的来源（例如维基百科）。这些文档作为上下文和输入的原始提示词组合，送给文本生成器得到最终的输出。这样 RAG 更加适应事实会随时间变化的情况。这非常有用，因为 LLM 的参数化知识是静态的。RAG 让语言模型不用重新训练就能够获取最新的信息，基于检索生成产生可靠的输出。来源：Retrieval Augmented Generation (RAG) for LLMs | Prompt Engineering Guide (promptingguide.ai)；Retrieval Augmented Generation (RAG) | Prompt Engineering Guide (promptingguide.ai)

RAG research

While RAG has also involved the optimization of pre-training methods, current approaches have largely shifted to combining the strengths of RAG and powerful fine-tuned models like ChatGPT and Mixtral. The chart below shows the evolution of RAG-related research:

Untitled

A typical RAG application workflow

Untitled

We can explain the different steps/components as follows:

Input: The question to which the LLM system responds is referred to as the input. If no RAG is used, the LLM is directly used to respond to the question.
Indexing: If RAG is used, then a series of related documents are indexed by chunking them first, generating embeddings of the chunks, and indexing them into a vector store. At inference, the query is also embedded in a similar way.
Retrieval: The relevant documents are obtained by comparing the query against the indexed vectors, also denoted as “Relevant Documents”.
Generation: The relevant documents are combined with the original prompt as additional context. The combined text and prompt are then passed to the model for response generation (refer to the pic: Please answer the above questions based on the following information:…) which is then prepared as the final output of the system to the user.

In the example provided, using the model directly fails to respond to the question due to a lack of knowledge of current events. On the other hand, when using RAG, the system can pull the relevant information needed for the model to answer the question appropriately.

RAG Paradigms

Untitled

Naive RAG

follows the traditional aforementioned process of indexing, retrieval, and generation
- LIMITATIONS
  - low precision (misaligned retrieved chunks)
  - low recall (failure to retrieve all relevant chunks)
  - may pass outdated information
  leads to hallucination issues and poor and inaccurate responses.
- challenge:
  - 冗余和重复问题
  - 确保生成任务不会过度依赖增强信息，这可能导致模型只是重复检索到的内容
Advanced RAG

有助于处理 Naive RAG 中存在的问题，例如提高检索质量，这可能涉及优化检索前、检索和检索后过程。
- pre-retrieval process
  
  involves optimizing data indexing which aims to enhance the quality of the data being indexed through five stages: enhancing data granularity, optimizing index structures, adding metadata, alignment optimization, and mixed retrieval.
- retrieval stage
  
  optimizing the embedding model itself which directly impacts the quality of the chunks that make up the context.This can be done by fine-tuning the embedding to optimize retrieval relevance or employing dynamic embeddings that better capture contextual understanding (e.g., OpenAI’s embeddings-ada-02 model).
- post-retrieval
  
  avoiding context window limits and dealing with noisy or potentially distracting information. A common approach to address these issues is re-ranking which could involve approaches such as relocation of relevant context to the edges of the prompt or recalculating the semantic similarity between the query and relevant text chunks. Prompt compression may also help in dealing with these issues.
Modular RAG

Modular RAG enhances functional modules. Extended RAG modules include search, memory, fusion, routing, predict, and task adapter which solve different problems. Modular RAG benefits from greater diversity and flexibility.

other important optimization techniques
- Hybrid Search Exploration, such as keyword-based search and semantic search
- Recursive Retrieval and Query Engine
- StepBack-prompt
- Sub-Queries
- Hypothetical Document Embeddings

RAG Framework

The key developments of the components of a RAG system: Retrieval, Generation, and Augmentation.

Retrieval

Retrieval is the component of RAG that deals with retrieving highly relevant context from a retriever.

增强其功能的方式包括：

Enhancing Semantic Representations. 涉及的考量:
- Chunking: One important step is choosing the right chunking strategy which depends on the content you are dealing with and the application you are generating responses for. Different models also display different strengths on varying block sizes.
- Fine-tuned Embedding Models: Once you have determined an effective chunking strategy, it may be required to fine-tune the embedding model if you are working with a specialized domain.
Aligning Queries and Documents
Aligning Queries and LLM

Generation

The generator in a RAG system is responsible for converting retrieved information into a coherent text that will form the final output of the model.

这一过程涉及多种输入数据，有时需要努力完善语言模型，使其适应来自查询和文档的输入数据。

This can be addressed using post-retrieval process and fine-tuning:

Post-retrieval with Frozen LLM

检索后处理不触及 LLM，而是侧重于通过信息压缩和结果重排等操作来提高检索结果的质量。
Fine-tuning LLM for RAG

Argumentation

Augmentation involves the process of effectively integrating context from retrieved passages with the current generation task.

can be applied in many different stages such as pre-training, fine-tuning, and inference.

Augmentation Stages
Augmentation Source A RAG model’s effectiveness is heavily impacted by the choice of augmentation data source. Data can be categorized into unstructured, structured, and LLM-generated data.
Augmentation Process For many problems (e.g., multi-step reasoning), a single retrieval isn’t enough so a few methods have been proposed:
- Iterative retrieval
- Recursive retrieval
- Adaptive retrieval

Untitled

RAG vs Finetuning

Research in these two areas suggests that RAG is useful for integrating new knowledge while fine-tuning can be used to improve model performance and efficiency through improving internal knowledge, output format, and teaching complex instruction following.

Untitled

RAG Evaluation

RAG evaluation targets are determined for both retrieval and generation where the goal is to evaluate both the quality of the context retrieved and the quality of the content generated.

retrieval quality: metrics used in other knowledge-intensive domains like recommendation systems and information retrieval are used such as NDCG and Hit Rate.
generation quality: can evaluate different aspects like relevance and harmfulness if it’s unlabeled content or accuracy for labeled content.

focuses on three primary quality scores and four abilities

Quality scores include measuring context relevance (i.e., the precision and specificity of retrieved context), answer faithfulness (i.e., the faithfulness of answers to the retrieved context), and answer relevance (i.e., the relevance of answers to posed questions).
four abilities that help measure the adaptability and efficiency of a RAG system: noise robustness, negative rejection, information integration, and counterfactual robustness.

Untitled

Conclusion

Untitled

参考资料

Retrieval Augmented Generation (RAG) for LLMs | Prompt Engineering Guide (promptingguide.ai)

论文参考：

Retrieval-Augmented Generation for Large Language Models: A Survey (Gao et al., 2023)
https://github.com/HKUST-AI-Lab/Awesome-LLM-with-RAG
https://github.com/horseee/Awesome-Efficient-LLM
https://github.com/XiaoxinHe/iclr2024_learning_on_graph

Coding Langchain: RAG From Scratch: Part 1 (Overview) - YouTube