【AI大模型应用学习笔记】RAG-Embedding-Vector知识点学习

发布日期: 2025-04-15

更新日期: 2025-05-10

文章字数: 9.9k

阅读时长: 44 分

阅读次数:

一、什么是检索增强的生成模型（RAG）

RAG（Retrieval-Augmented Generation，检索增强生成）是一种结合了信息检索技术与语言生成模型的人工智能技术。该技术通过从外部知识库中检索相关信息，并将其作为提示（Prompt）输入给大型语言模型（LLMs），以增强模型处理知识密集型任务的能力，如问答、文本摘要、内容生成等。RAG模型由Facebook AI Research（FAIR）团队于2020年首次提出，并迅速成为大模型应用中的热门方案。

1.1 大模型目前固有的局限性

1.LMM的知识不是实时的
2.LMM可能不知道你私有的领域/业务知识

类比：你可以把这个过程想象成开卷考试。让 LLM 先翻书，再回答问题。

1.2 检索增强生成（RAG）

什么是RAG？

RAG（Retrieval-Augmented Generation，检索增强生成），RAG是一种 AI 框架，它将传统信息检索系统（例如数据库）的优势与生成式大语言模型 (LLM) 的功能结合在一起。

LLM通过将这些额外的知识与自己的语言技能相结合，可以撰写更准确、更具时效性且更贴合具体需求的文字。

什么是RAG?

如何理解RAG？

通过上一个问题，我们知道了什么是RAG？了解到RAG是一种结合了信息检索、文本增强和文本生成的自然语言处理（NLP）的技术。

RAG的目的是通过从外部知识库检索相关信息来辅助大语言模型生成更准确、更丰富的文本内容。那我们如何理解RAG的检索、增强和生成呢？

检索：检索是RAG流程的第一步，从预先建立的知识库中检索与问题相关的信息。这一步的目的是为后续的生成过程提供有用的上下文信息和知识支撑。
增强：RAG中增强是将检索到的信息用作生成模型（即大语言模型）的上下文输入，以增强模型对特定问题的理解和回答能力。这一步的目的是将外部知识融入生成过程中，使生成的文本内容更加丰富、准确和符合用户需求。通过增强步骤，LLM模型能够充分利用外部知识库中的信息。
生成：生成是RAG流程的最后一步。这一步的目的是结合LLM生成符合用户需求的回答。生成器会利用检索到的信息作为上下文输入，并结合大语言模型来生成文本内容。

RAG的“检索、增强、生成”，谁增强了谁，谁生成了答案，主语很重要。是从知识库中检索到的问答对，增强了LLM的提示词（prompt），LLM拿着增强后的Prompt生成了问题答案。

如何使用RAG?

我们如何使用RAG？接下来以RAG搭建知识问答系统具体步骤为例，来讲解如何使用RAG？

数据准备与知识库构建：

收集数据： 首先，需要收集与问答系统相关的各种数据，这些数据可以来自文档、网页、数据库等多种来源。
数据清洗： 对收集到的数据进行清洗，去除噪声、重复项和无关信息，确保数据的质量和准确性。
知识库构建： 将清洗后的数据构建成知识库。这通常包括将文本分割成较小的片段（chunks），使用文本嵌入模型（如GLM）将这些片段转换成向量，并将这些向量存储在向量数据库（如FAISS、Milvus等）中。

检索模块设计：

问题向量化： 当用户输入查询问题时，使用相同的文本嵌入模型将问题转换成向量。
相似度检索： 在向量数据库中检索与问题向量最相似的知识库片段（chunks）。这通常通过计算向量之间的相似度（如余弦相似度）来实现。
结果排序： 根据相似度得分对检索到的结果进行排序，选择最相关的片段作为后续生成的输入。

生成模块设计：

上下文融合：将检索到的相关片段与原始问题合并，形成更丰富的上下文信息。
大语言模型生成：使用大语言模型（如GLM）基于上述上下文信息生成回答。大语言模型会学习如何根据检索到的信息来生成准确、有用的回答。

大家可以结合自己的业务领域知识，开始搭建医疗、法律、产品知识问答。先搭建Demo，然后工作中不断完善知识库问答对。

1.3 RAG工作原理是什么？

大型语言模型（LLM）面临两个问题，第一个问题是LLM会产生幻觉，第二个是LLM的知识中断。

幻觉：当模型所训练的数据没有问题的答案时，它会自信地做出错误反应，就会发生幻觉。
知识截止：当 LLM 返回的信息与模型的训练数据相比过时时。每个基础模型都有知识截止，这意味着其知识仅限于训练时可用的数据。

检索增强生成 (RAG) 摆脱了知识限制，整合了外部数据，从外部知识库中检索相关信息，增强模型的生成能力。

RAG的工作原理

1.4 RAG基本搭建流程

通过检索增强技术，将用户查询与索引知识融合，利用大语言模型生成准确回答。

知识准备：收集并转换知识文档为文本数据，进行预处理和索引。
嵌入与索引：使用嵌入模型将文本转换为向量，并存储在向量数据库中。
查询检索：用户查询转换为向量，从数据库中检索相关知识。
提示增强：结合检索结果构建增强提示模版。
生成回答：大语言模型根据增强模版生成准确回答。

1.5 RAG技术架构

RAG技术架构主要由两个核心模块组成，检索模块（Retriever）和生成模块（Generator）。

检索模块（Retriever）：

文本嵌入：使用预训练的文本嵌入模型（如GLM）将查询和文档转换成向量表示，以便在向量空间中进行相似度计算。
向量搜索：利用高效的向量搜索技术（如FAISS、Milvus等向量数据库）在向量空间中检索与查询向量最相似的文档或段落。
双塔模型：检索模块常采用双塔模型（Dual-Encoder）进行高效的向量化检索。双塔模型由两个独立的编码器组成，一个用于编码查询，另一个用于编码文档。这两个编码器将查询和文档映射到相同的向量空间中，以便进行相似度计算。

生成模块（Generator）：

强大的生成模型：生成模块通常使用在大规模数据上预训练的生成模型（如GLM），这些模型在生成自然语言文本方面表现出色。
上下文融合：生成模块将检索到的相关文档与原始查询合并，形成更丰富的上下文信息，作为生成模型的输入。
生成过程：生成模型根据输入的上下文信息，生成连贯、准确且信息丰富的回答或文本。

结合高效的检索模块（Retriever）与强大的生成模型（Generator），实现基于外部知识增强的自然语言生成能力。

二、RAG的工作原理和基本搭建流程

RAG基本搭建流程

RAG搭建过程

1.文档加载，并按一定条件切割成片
2.将切割的文本片段灌人检索引擎
3.封装检索接口
4.构建调用流程：Query -> 检索 -> Prompt -> LLM -> 回复

2.1 文档的加载与切割

安装 pdf 解析库

pip install pdfminer.six

构建文档提取文字方法

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1):
    '''从 PDF 文件中（按指定页码）提取文字'''
    
    paragraphs = []
    buffer = ''
    full_text = ''

    # 提取全部文本
    for i, page_layout in enumerate(extract_pages(filename)):
        # 如果指定了页码范围，跳过范围外的页
        if page_numbers is not None and i not in page_numbers:
            continue
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                full_text += element.get_text() + '\n'

    # 按空行分隔，将文本重新组织成段落
    lines = full_text.split('\n')
    for text in lines:
        if len(text) >= min_line_length:
            buffer += (' '+text) if not text.endswith('-') else text.strip('-')
        elif buffer:
            paragraphs.append(buffer)
            buffer = ''
    if buffer:
        paragraphs.append(buffer)
    return paragraphs

调用方法加载本地文档

paragraphs = extract_text_from_pdf("llama2.pdf", min_line_length=10)

# 打印文档前4段内容
for para in paragraphs[:4]:
    print(para+"\n")

2.2 LLM接口封装

安装openai库和环境变量库

pip install --upgrade openai
pip install -U python-dotenv

加载环境变量，将我们的OpenAI Key 加载进来，在根目录建一个 .env 文件，把我们申请的OPENAI_API_KEY 填写进去，文件内容如下：

OPENAI_API_KEY=Bearer hk-xxxxxxxxxxxxxxx
OPENAI_BASE_URL=https://api.openai-hk.com/v1

编写代码加载环境变量

import os
from openai import OpenAI

# 加载环境变量
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv(), verbose=True)  # 读取本地 .env 文件，里面定义了 OPENAI_API_KEY

client = OpenAI(
	api_key=os.getenv("OPENAI_API_KEY"),
	base_url="https://api.openai-hk.com/v1", )  # 使用香港的 API 服务器

安装requests包

pip install requests

封装 openai 接口

def get_completion(prompt, model="gpt-4o"):
    '''封装 openai 接口'''

    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,   # 控制输出的随机性，0.0-1.0之间，越小越确定
    )
    return response.choices[0].message.content

2.3 Prompt 模板

def build_prompt(prompt_template, **kwargs):
    '''将 Prompt 模板赋值'''
    
    inputs = {}
    for k, v in kwargs.items():
        if isinstance(v, list) and all(isinstance(elem, str) for elem in v):
            val = '\n\n'.join(v)
        else:
            val = v
        inputs[k] = val

    return prompt_template.format(**inputs)

prompt_template = """
你是一个问答机器人。
你的任务是根据下述给定的已知信息，回答用户的问题。
  
已知信息：
{context} # 从向量数据库检索出原始文档

用户问：
{query} # 用户的提问

如果已知信息中不包含用户问题的答案，或者已知信息不足以回答用户问题，请回答“我无法回答您的问题”。
请不要输出已知信息中不包含的信息或答案。
请用中文回答用户问题。
"""

三、向量检索

3.1 什么是向量

向量是一种有大小和方向的数学对象。它可以表示为从一个点到另一个点的有向线段。例如，二维空间中的向量可以表示为 $(x,y)$, 表示从原点 $(0,0)$ 到点 $(x,y)$ 的有向线段。

向量坐标图

以此类推，我可以用一组坐标 $(x_0, x_1, \ldots, x_{N-1})$ 表示一个 $N$ 维空间中的向量, $N$ 叫向量的维度。

3.1.1 文本向量（Text Embeddings）

将文本转换成一组 $N$ 维浮点数，即文本向量又叫 Embeddings
向量之间可以计算距离，距离远近对应语义相似度大小

embeddings

3.1.2 文本向量是怎样得到的

构建相关（正例）与不相关（负例）的句子对样本
训练双塔模型，让正例间的距离小，负例间的距离大

例如：
sbert

扩展阅读：https://www.sbert.net

3.2 向量间的相似度计算

向量间的相似度计算在数学上有欧氏距离和余弦距离两种。
向量相似度计算

安装numpy库

pip install numpy

构建相似度计算公式：

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
    '''余弦距离 -- 越大越相似'''
    return dot(a, b)/(norm(a)*norm(b))

def l2(a, b):
    '''欧氏距离 -- 越小越相似'''
    x = np.asarray(a)-np.asarray(b)
    return norm(x)

封装 openai 的 Embeddings 模型接口

def get_embeddings(texts, model="text-embedding-ada-002", dimensions=None):
    '''封装 OpenAI 的 Embedding 模型接口'''
    if model == "text-embedding-ada-002":
        dimensions = None
        
    if dimensions:
        data = client.embeddings.create(
            input=texts, model=model, dimensions=dimensions).data
    else:
        data = client.embeddings.create(input=texts, model=model).data

    return [x.embedding for x in data]

Embedding模型的选择标准：找需求相关的语料库来进行文本向量转换测试，进行评估。
大多数场景下，开源的嵌入模型使用都很一般，要提升检索召回率，建议对模型进行微调。

进行测试

test_query = ["测试文本"]
vec = get_embeddings(test_query)[0]
print(f"Total dimension: {len(vec)}")
print(f"First 10 elements: {vec[:10]}")

输出的结果如下：

Total dimension: 1536
First 10 elements: [-0.007280503865331411, -0.006169671658426523, -0.010576579719781876, 0.001448634546250105, -0.010707695037126541, 0.02919485792517662, -0.019725468009710312, 0.0053902678191661835, -0.016957491636276245, -0.01203340943902731]

我们来用一些文本例子，计算比较它们的向量相似度

query = "国际争端"

documents = [
    "联合国就苏丹达尔富尔地区大规模暴力事件发出警告",
    "土耳其、芬兰、瑞典与北约代表将继续就瑞典“入约”问题进行谈判",
    "日本岐阜市陆上自卫队射击场内发生枪击事件 3人受伤",
    "国家游泳中心（水立方）：恢复游泳、嬉水乐园等水上项目运营",
    "我国首次在空间站开展舱外辐射生物学暴露实验",
]

query_vec = get_embeddings(query)[0]
doc_vecs = get_embeddings(documents)

# 计算余弦相似度
print("Query与自己的余弦相似度：{:.2f}".format(cos_sim(query_vec, query_vec)))

# 计算Query与Documents的余弦相似度
print("Query与Documents的余弦相似度：")
for vec in doc_vecs:
    print(cos_sim(query_vec, vec))

print()

# 计算欧氏距离
print("Query与自己的欧氏距离：{:.2f}".format(l2(query_vec, query_vec)))

# 计算Query与Documents的欧氏距离
print("Query与Documents的欧氏距离：")
for vec in doc_vecs:
    print(l2(query_vec, vec))

运行的结果如下：

Query与自己的余弦相似度：1.00
Query与Documents的余弦相似度：
0.8218706620454886
0.8293571683832858
0.7977047336154321
0.766980176753734
0.7930490196304245

Query与自己的欧氏距离：0.00
Query与Documents的欧氏距离：
0.5968741071718394
0.5841966059330832
0.6360743166598111
0.6826709412992098
0.6433521233350966

3.3 向量数据库

向量数据库是专门为向量检索设计的中间件

安装chromadb向量库包

pip install chromadb

解析文档

# 为了演示方便，我们只取两页（第一章）
paragraphs = extract_text_from_pdf(
    "llama2.pdf",
    page_numbers=[2, 3],
    min_line_length=10
    )

创建MyVectorDBConnector类

import chromadb
from chromadb.config import Settings

# 创建MyVectorDBConnector类
class MyVectorDBConnector:
    def __init__(self, collection_name, embedding_fn):
        chroma_client = chromadb.Client(Settings(allow_reset=True))
        # 为了演示，实际不需要每次 reset()
        chroma_client.reset()

        # 创建一个 collection
        self.collection = chroma_client.get_or_create_collection(name=collection_name)
        self.embedding_fn = embedding_fn
  
    def add_documents(self, documents):
        '''向 collection 中添加文档与向量'''
        self.collection.add(
            embeddings=self.embedding_fn(documents),  # 每个文档的向量
            documents=documents,  # 文档的原文
            ids=[f"id{i}" for i in range(len(documents))]  # 每个文档的 id
        )
        
    def search(self, query, top_n):
        '''检索向量数据库'''
        results = self.collection.query(
            query_embeddings=self.embedding_fn([query]),
            n_results=top_n
        )

        return results

# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo", get_embeddings)
# 向量数据库中添加文档
vector_db.add_documents(paragraphs)

user_query = "Llama 2有多少参数？"

# user_query = "Does Llama 2 have a conversational variant"
results = vector_db.search(user_query, 2)

for para in results['documents'][0]:
    print(para+"\n")

输出结果如下：

1. Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.§

 In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models, at least on the human evaluations we performed (see Figures 1 and 3). We have taken measures to increase the safety of these models, using safety-speciﬁc data annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally, this paper contributes a thorough description of our ﬁne-tuning methodology and approach to improving LLM safety. We hope that this openness will enable the community to reproduce ﬁne-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs. We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge.

澄清几个关键概念：

向量数据库的意义是快速的检索；
向量数据库本身不生成向量，向量是由 Embedding 模型产生的；
向量数据库与传统的关系型数据库是互补的，不是替代关系，在实际应用中根据实际需求经常同时使用。

3.3.1 向量数据库服务

Server 端

chroma run --path /db_path

Client 端

import chromadb

chroma_client = chromadb.HttpClient(host='localhost', port=8000)

3.3.2 主流向量数据库功能对比

主流向量数据库功能对比

FAISS: Meta 开源的向量检索引擎 https://github.com/facebookresearch/faiss
Pinecone: 商用向量数据库，只有云服务 https://www.pinecone.io/
Milvus: 开源向量数据库，同时有云服务 https://milvus.io/
Weaviate: 开源向量数据库，同时有云服务 https://weaviate.io/
Qdrant: 开源向量数据库，同时有云服务 https://qdrant.tech/
PGVector: Postgres 的开源向量检索引擎 https://github.com/pgvector/pgvector
RediSearch: Redis 的开源向量检索引擎 https://github.com/RediSearch/RediSearch
ElasticSearch 也支持向量检索 https://www.elastic.co/enterprise-search/vector-search

扩展阅读：https://guangzhengli.com/blog/zh/vector-database

3.4 基于向量检索的RAG

构建RAG_Bot类

class RAG_Bot:
    def __init__(self, vector_db, llm_api, n_results=2):
        self.vector_db = vector_db
        self.llm_api =  llm_api
        self.n_results = n_results
    def chat(self, user_query):
    
        # 1. 检索
        search_results = self.vector_db.search(user_query, self.n_results)

        # 2. 构建 Prompt
        prompt = build_prompt(
            prompt_template, context=search_results['documents'][0], query=user_query)
            
        # 3. 调用 LLM
        response = self.llm_api(prompt)
        return response

创建一个RAG机器人对象

# 创建一个RAG机器人
bot = RAG_Bot(
    vector_db,
    llm_api=get_completion
)

user_query = "llama 2有多少参数?"

response = bot.chat(user_query)

print(response)

这里输出内容如下：

Llama 2有7B、13B和70B参数的变体。

3.5 OpenAI 新发布的两个Embedding模型

2024 年 1 月 25 日，OpenAI 新发布了两个 Embedding 模型

text-embedding-3-large
text-embedding-3-small

其最大特点是，支持自定义的缩短向量维度，从而在几乎不影响最终效果的情况下降低向量检索与相似度计算的复杂度。

通俗的说：越大越准、越小越快。 官方公布的评测结果:

官方测评结果

注：MTEB 是一个大规模多任务的 Embedding 模型公开评测集

RAG系统基本需要用到两个模型
embedded模型：采用open ai 的线上模型，向量模型的精确度直接影响query 相似度检索的文档召回率
文本生成模型（对话模型）：采用本地私有化部署模型

测试 text-embedding-3-large Embedding模型效果

model = "text-embedding-3-large"
dimensions = 128

# query = "国际争端"

# 且能支持跨语言
query = "global conflicts"

documents = [
    "联合国就苏丹达尔富尔地区大规模暴力事件发出警告",
    "土耳其、芬兰、瑞典与北约代表将继续就瑞典“入约”问题进行谈判",
    "日本岐阜市陆上自卫队射击场内发生枪击事件 3人受伤",
    "国家游泳中心（水立方）：恢复游泳、嬉水乐园等水上项目运营",
    "我国首次在空间站开展舱外辐射生物学暴露实验",
]

query_vec = get_embeddings([query], model=model, dimensions=dimensions)[0]

doc_vecs = get_embeddings(documents, model=model, dimensions=dimensions)

print("向量维度: {}".format(len(query_vec)))

print()

print("Query与Documents的余弦距离:")
for vec in doc_vecs:
    print(cos_sim(query_vec, vec))
   
print()

print("Query与Documents的欧氏距离:")

for vec in doc_vecs:
    print(l2(query_vec, vec))

输出内容如下：

向量维度: 128

Query与Documents的余弦距离:
0.33418796172379334
0.35462977252280126
0.31364599128817017
0.22422448391215455
0.12849126788491727

Query与Documents的欧氏距离:
1.1539601565325632
1.1361075718324967
1.1716262617987068
1.2456127243145834
1.320233859479998

扩展阅读：这种可变长度的 Embedding 技术背后的原理叫做 Matryoshka Representation Learning

四、实战 RAG 系统的进阶知识

4.1 文本分割的粒度

缺陷

粒度太大可能导致检索不精准，粒度太小可能导致信息不全面
问题的答案可能跨越两个片段

# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo_text_split", get_embeddings)

# 向向量数据库中添加文档
vector_db.add_documents(paragraphs)

# 创建一个RAG机器人
bot = RAG_Bot(
    vector_db,
    llm_api=get_completion
)

# user_query = "llama 2有商用许可协议吗?"
user_query="llama 2 chat有多少参数?"
search_results = vector_db.search(user_query, 2)

for doc in search_results['documents'][0]:
    print(doc+"\n")

response = bot.chat(user_query)
print("====回复====")

print("==============================")

print(response)
for p in paragraphs:
    print("==========")
    print(p+"\n")

输出内容如下：

In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models, at least on the human evaluations we performed (see Figures 1 and 3). We have taken measures to increase the safety of these models, using safety-speciﬁc data annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally, this paper contributes a thorough description of our ﬁne-tuning methodology and approach to improving LLM safety. We hope that this openness will enable the community to reproduce ﬁne-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs. We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge.

 2. Llama 2-Chat, a ﬁne-tuned version of Llama 2 that is optimized for dialogue use cases. We release

====回复====
Llama 2-Chat的参数规模可以达到70B。

改进: 按一定粒度，部分重叠式的切割文本，使上下文更完整

安装nltk包

pip install nltk

重新按照一定条件来切割文档

from nltk.tokenize import sent_tokenize
import json

# chunk_size 一般根据文档内容或大小来设置
# overlap_size 一般设置 chunk_size 大小的10%-20%之间
def split_text(paragraphs, chunk_size=300, overlap_size=100):
    '''按指定 chunk_size 和 overlap_size 交叠割文本'''
    
    sentences = [s.strip() for p in paragraphs for s in sent_tokenize(p)]
    # sentences = [s.strip() for p in paragraphs for s in sent_tokenize(p, language='chinese')]
    chunks = []
    i = 0

    while i < len(sentences):
        chunk = sentences[i]
        overlap = ''
        prev_len = 0
        prev = i - 1

        # 向前计算重叠部分
        while prev >= 0 and len(sentences[prev])+len(overlap) <= overlap_size:
            overlap = sentences[prev] + ' ' + overlap
            prev -= 1
        chunk = overlap+chunk
        next = i + 1
        # 向后计算当前chunk
        while next < len(sentences) and len(sentences[next])+len(chunk) <= chunk_size:
            chunk = chunk + ' ' + sentences[next]
            next += 1
        chunks.append(chunk)
        i = next

    return chunks

chunks = split_text(paragraphs, 300, 100)

此处 sent_tokenize 为针对英文的实现，针对中文的实现请参考 chinese_utils.py

# 创建一个向量数据库对象
vector_db = MyVectorDBConnector("demo_text_split", get_embeddings)

# 向向量数据库中添加文档
vector_db.add_documents(chunks)

# 创建一个RAG机器人
bot = RAG_Bot(
     vector_db,
     llm_api=get_completion
)

# user_query = "llama 2有商用许可协议吗"
user_query="llama 2 chat有多少参数"

search_results = vector_db.search(user_query, 2)

for doc in search_results['documents'][0]:
    print(doc+"\n")

response = bot.chat(user_query)
print("====回复====")
print(response)

结果输出如下：

2. Llama 2-Chat, a ﬁne-tuned version of Llama 2 that is optimized for dialogue use cases. We release variants of this model with 7B, 13B, and 70B parameters as well. We believe that the open release of LLMs, when done safely, will be a net beneﬁt to society.

In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models.

====回复====
Llama 2-Chat 具有 7B、13B 和 70B 三种不同的参数规模。

文本切割

chunksize 和 overlap 来重叠切割
- \n \n\n 基于某些规则来切分的
对于复杂的文本的切分
- NSP任务来进行微调训练（拿自己的业务数据来喂投）
- A和B两个句子（段落）是否有关系
若有关系则进行合并

4.2 检索后排序

问题：有时，最合适的答案不一定排在检索的最前面，例如：

user_query = "how safe is llama 2"
search_results = vector_db.search(user_query, 5)

for doc in search_results['documents'][0]:
    print(doc+"\n")

response = bot.chat(user_query)
print("====回复====")
print(response)

结果输出如下：

We believe that the open release of LLMs, when done safely, will be a net beneﬁt to society. Like all LLMs, Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023).

We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge. Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed source models.

In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models.

Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the Llama 2-Chat models. We are releasing the following models to the general public for research and commercial use‡: 1.

We provide a responsible use guide¶ and code examples‖ to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of our responsible release strategy can be found in Section 5.3.

====回复====
我无法回答您的问题。

方案:

检索时通过招回一部分文本
通过一个排序模型对 query 和 document 重新打分排序

检索召回

以下代码运行前我们要确保能访问 Hugging Face！

安装`sentence_transformers`库

pip install sentence_transformers

from sentence_transformers import CrossEncoder

# 访问调用Hugging Face模型
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512) # 英文，模型较小
# model = CrossEncoder('BAAI/bge-reranker-large', max_length=512) # 多语言，国产，模型较大


user_query = "how safe is llama 2"
# user_query = "llama 2安全性如何"

scores = model.predict([(user_query, doc) for doc in search_results['documents'][0]])

# 按得分排序
sorted_list = sorted(zip(scores, search_results['documents'][0]), key=lambda x: x[0], reverse=True)

for score, doc in sorted_list:
    print(f"{score}\t{doc}\n")

输出结果如下，可以看到下面内容是按得分然后重新排序得到的：

6.613734722137451	We believe that the open release of LLMs, when done safely, will be a net beneﬁt to society. Like all LLMs, Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023).

5.310717582702637	In this work, we develop and release Llama 2, a family of pretrained and ﬁne-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models.

4.709955215454102	We provide a responsible use guide¶ and code examples‖ to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of our responsible release strategy can be found in Section 5.3.

4.5439653396606445	We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge. Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed source models.

4.0338897705078125	Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the Llama 2-Chat models. We are releasing the following models to the general public for research and commercial use‡: 1.

！所以切割也比较重要，这里我因为前面新切割方法报错所以没有使用新的切割后的文本，这里输出的答案有些是不同的

一些 Rerank 的 API 服务

Cohere Rerank：支持多语言
Jina Rerank：目前只支持英文

4.3 混合检索（Hybrid Search）

在实际生产中，传统的关键字检索（稀疏表示）与向量检索（稠密表示）各有优劣。

举个具体例子，比如文档中包含很长的专有名词，关键字检索往往更精准而向量检索容易引入概念混淆。

# 背景说明：在医学中“小细胞肺癌”和“非小细胞肺癌”是两种不同的癌症
query = "非小细胞肺癌的患者"

documents = [
    "玛丽患有肺癌，癌细胞已转移",
    "刘某肺癌I期",
    "张某经诊断为非小细胞肺癌III期",
    "小细胞肺癌是肺癌的一种"
]

query_vec = get_embeddings([query])[0]
doc_vecs = get_embeddings(documents)

print("Cosine distance:")
for vec in doc_vecs:
    print(cos_sim(query_vec, vec))

输出的结果如下：

Cosine distance:
0.8912871311166322
0.8896135883660553
0.9039894669142209
0.9131454293290566

所以，有时候我们需要结合不同的检索算法，来达到比单一检索算法更优的效果。这就是混合检索。
混合检索的核心是，综合文档 $d$ 在不同检索算法下的排序名次（rank），为其生成最终排序。
一个最常用的算法叫 Reciprocal Rank Fusion（RRF）
$rrf(d)=\sum_{a\in A}\frac{1}{k+rank_a(d)}$
其中 $A$ 表示所有使用的检索算法的集合，$rank_a(d)$ 表示使用算法 $a$ 检索时，文档 $d$ 的排序，$k$ 是个常数。
很多向量数据库都支持混合检索，比如 Weaviate、Pinecone 等。也可以根据上述原理自己实现。

4.3.1 手写个简单的例子

注意：需要安装好 Elastic Search Server，并启动！

1.基于关键字检索的排序

pip install elasticsearch7

import time

class MyEsConnector:
    def __init__(self, es_client, index_name, keyword_fn):
        self.es_client = es_client
        self.index_name = index_name
        self.keyword_fn = keyword_fn

    def add_documents(self, documents):
        '''文档灌库'''
        if self.es_client.indices.exists(index=self.index_name):
            self.es_client.indices.delete(index=self.index_name)
        self.es_client.indices.create(index=self.index_name)
        actions = [
            {
                "_index": self.index_name,
                "_source": {
                    "keywords": self.keyword_fn(doc),
                    "text": doc,
                    "id": f"doc_{i}"
                }
            }
            for i, doc in enumerate(documents)
        ]
        helpers.bulk(self.es_client, actions)
        time.sleep(1)

    def search(self, query_string, top_n=3):
        '''检索'''
        search_query = {
            "match": {
                "keywords": self.keyword_fn(query_string)
            }
        }
        res = self.es_client.search(
            index=self.index_name, query=search_query, size=top_n)
        return {
            hit["_source"]["id"]: {
                "text": hit["_source"]["text"],
                "rank": i,
            }
            for i, hit in enumerate(res["hits"]["hits"])
        }

from chinese_utils import to_keywords  # 使用中文的关键字提取函数

# 引入配置文件
ELASTICSEARCH_BASE_URL = os.getenv('ELASTICSEARCH_BASE_URL')
ELASTICSEARCH_PASSWORD = os.getenv('ELASTICSEARCH_PASSWORD')
ELASTICSEARCH_NAME= os.getenv('ELASTICSEARCH_NAME')

es = Elasticsearch(
    hosts=[ELASTICSEARCH_BASE_URL],
    http_auth=(ELASTICSEARCH_NAME, ELASTICSEARCH_PASSWORD),  # 用户名，密码
)

# 创建 ES 连接器
es_connector = MyEsConnector(es, "demo_es_rrf", to_keywords)

# 文档灌库
es_connector.add_documents(documents)

# 关键字检索
keyword_search_results = es_connector.search(query, 3)

print(json.dumps(keyword_search_results, indent=4, ensure_ascii=False))

2.基于向量检索的排序

# 创建向量数据库连接器
vecdb_connector = MyVectorDBConnector("demo_vec_rrf", get_embeddings)

# 文档灌库
vecdb_connector.add_documents(documents)

# 向量检索
vector_search_results = {
    "doc_"+str(documents.index(doc)): {
        "text": doc,
        "rank": i
    }
    for i, doc in enumerate(
        vecdb_connector.search(query, 3)["documents"][0]
    )
}  # 把结果转成跟上面关键字检索结果一样的格式

print(json.dumps(vector_search_results, indent=4, ensure_ascii=False))

3.基于 RRF 的融合排序

参考资料：https://learn.microsoft.com/zh-cn/azure/search/hybrid-search-ranking

import json

def rrf(ranks, k=1):
    ret = {}
    # 遍历每次的排序结果
    for rank in ranks:
        # 遍历排序中每个元素
        for id, val in rank.items():
            if id not in ret:
                ret[id] = {"score": 0, "text": val["text"]}
            # 计算 RRF 得分
            ret[id]["score"] += 1.0/(k+val["rank"])
    # 按 RRF 得分排序，并返回
    return dict(sorted(ret.items(), key=lambda item: item[1]["score"], reverse=True))

# 融合两次检索的排序结果
reranked = rrf([keyword_search_results, vector_search_results])
print(json.dumps(reranked, indent=4, ensure_ascii=False))

五、PDF文档中的表格怎样处理

PDF中表格处理

5.1 将每页 PDF 转成图片

安装 PyMuPDF 和 matplotlib 库

pip install PyMuPDF
pip install matplotlib

import os
import fitz
from PIL import Image

def pdf2images(pdf_file):
    '''将 PDF 每页转成一个 PNG 图像'''

    # 保存路径为原 PDF 文件名（不含扩展名）
    output_directory_path, _ = os.path.splitext(pdf_file)
    if not os.path.exists(output_directory_path):
        os.makedirs(output_directory_path)
    # 加载 PDF 文件
    pdf_document = fitz.open(pdf_file)
    # 每页转一张图
    for page_number in range(pdf_document.page_count):
        # 取一页
        page = pdf_document[page_number]
        # 转图像
        pix = page.get_pixmap()
        # 从位图创建 PNG 对象
        image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        # 保存 PNG 文件
        image.save(f"./{output_directory_path}/page_{page_number + 1}.png")
    # 关闭 PDF 文件
    pdf_document.close()

from PIL import Image
import os
import matplotlib.pyplot as plt

def show_images(dir_path):

    '''显示目录下的 PNG 图像'''
    for file in os.listdir(dir_path):
        if file.endswith('.png'):
            # 打开图像
            img = Image.open(os.path.join(dir_path, file))
            # 显示图像
            plt.imshow(img)
            plt.axis('off')  # 不显示坐标轴
            plt.show()

pdf2images("llama2_page8.pdf")
show_images("llama2_page8")

结果输出如下：

PDF表格转图片

5.2 识别文档（图片）中的表格

class MaxResize(object):
    '''缩放图像'''
    def __init__(self, max_size=800):
        self.max_size = max_size

    def __call__(self, image):
        width, height = image.size
        current_max_size = max(width, height)
        scale = self.max_size / current_max_size
        resized_image = image.resize(
            (int(round(scale * width)), int(round(scale * height)))
        )

        return resized_image

安装三个环境包：

pip install torchvision
pip install transformers
pip install timm

import torchvision.transforms as transforms

# 图像预处理
detection_transform = transforms.Compose(
    [
        MaxResize(800),
        # 将原始的PILImage格式的数据格式化为可被pytorch快速处理的张量类型
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]
)

from transformers import AutoModelForObjectDetection

# 加载 TableTransformer 模型
model = AutoModelForObjectDetection.from_pretrained(
    "microsoft/table-transformer-detection"
)

识别后的坐标换算和后处理

# 识别后的坐标换算与后处理
def box_cxcywh_to_xyxy(x):
    '''坐标转换'''
    x_c, y_c, w, h = x.unbind(-1)
    b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
    return torch.stack(b, dim=1)

def rescale_bboxes(out_bbox, size):
    '''区域缩放'''
    width, height = size
    boxes = box_cxcywh_to_xyxy(out_bbox)
    boxes = boxes * torch.tensor(
        [width, height, width, height], dtype=torch.float32
    )
    return boxes

def outputs_to_objects(outputs, img_size, id2label):
    '''从模型输出中取定位框坐标'''
    m = outputs.logits.softmax(-1).max(-1)
    pred_labels = list(m.indices.detach().cpu().numpy())[0]
    pred_scores = list(m.values.detach().cpu().numpy())[0]
    pred_bboxes = outputs["pred_boxes"].detach().cpu()[0]
    pred_bboxes = [
        elem.tolist() for elem in rescale_bboxes(pred_bboxes, img_size)
    ]

    objects = []
    for label, score, bbox in zip(pred_labels, pred_scores, pred_bboxes):
        class_label = id2label[int(label)]
        if not class_label == "no object":
            objects.append(
                {
                    "label": class_label,
                    "score": float(score),
                    "bbox": [float(elem) for elem in bbox],
                }
            )
    return objects

识别表格，并将表格部分单独存为图像文件

import torch

# 识别表格，并将表格部分单独存为图像文件

def detect_and_crop_save_table(file_path):
    # 加载图像（PDF页）    
    image = Image.open(file_path)
    filename, _ = os.path.splitext(os.path.basename(file_path))
    # 输出路径
    cropped_table_directory = os.path.join(os.path.dirname(file_path), "table_images")
    if not os.path.exists(cropped_table_directory):
        os.makedirs(cropped_table_directory)
    # 预处理
    pixel_values = detection_transform(image).unsqueeze(0)
    # 识别表格
    with torch.no_grad():
        outputs = model(pixel_values)

    # 后处理，得到表格子区域
    id2label = model.config.id2label
    id2label[len(model.config.id2label)] = "no object"
    detected_tables = outputs_to_objects(outputs, image.size, id2label)

    print(f"number of tables detected {len(detected_tables)}")

    for idx in range(len(detected_tables)):
        # 将识别从的表格区域单独存为图像
        cropped_table = image.crop(detected_tables[idx]["bbox"])
        cropped_table.save(os.path.join(cropped_table_directory,f"{filename}_{idx}.png"))

最后结果输出如下：
表格转图片

表格转图片

5.3 基于 GPT-4 Vision API 做表格回答

import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

def image_qa(query, image_path):
    base64_image = encode_image(image_path)
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        seed=42,
        messages=[{
            "role": "user",
              "content": [
                  {"type": "text", "text": query},
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": f"data:image/jpeg;base64,{base64_image}",
                      },
                  },
              ],
        }],
    )

    return response.choices[0].message.content


response = image_qa("哪个模型在AGI Eval数据集上表现最好。得分多少","llama2_page8/table_images/page_1_0.png")

print(response)

结果输出如下：

在AGI Eval数据集上表现最好的模型是 **Llama 2 70B**，得分为 **54.2**。

可以看到，大模型的回答就是根据我们上面的图片内容回答的。

5.4 用GPT-4 Vision 生成表格(图片)描述,并向量化用于检索

import chromadb
from chromadb.config import Settings

class NewVectorDBConnector:
    def __init__(self, collection_name, embedding_fn):
        chroma_client = chromadb.Client(Settings(allow_reset=True))
        # 为了演示，实际不需要每次 reset()
        chroma_client.reset()
        # 创建一个 collection
        self.collection = chroma_client.get_or_create_collection(
            name=collection_name)
        self.embedding_fn = embedding_fn

    def add_documents(self, documents):
        '''向 collection 中添加文档与向量'''
        self.collection.add(
            embeddings=self.embedding_fn(documents),  # 每个文档的向量
            documents=documents,  # 文档的原文
            ids=[f"id{i}" for i in range(len(documents))]  # 每个文档的 id
        )

    def add_images(self, image_paths):
        '''向 collection 中添加图像'''
        documents = [
            image_qa("请简要描述图片中的信息",image)
            for image in image_paths
        ]
        self.collection.add(
            embeddings=self.embedding_fn(documents),  # 每个文档的向量
            documents=documents,  # 文档的原文
            ids=[f"id{i}" for i in range(len(documents))],  # 每个文档的 id
            metadatas=[{"image": image} for image in image_paths] # 用 metadata 标记源图像路径
        )

    def search(self, query, top_n):
        '''检索向量数据库'''
        results = self.collection.query(
            query_embeddings=self.embedding_fn([query]),
            n_results=top_n
        )
        return results

images = []
dir_path = "llama2_page8/table_images"
for file in os.listdir(dir_path):
    if file.endswith('.png'):
        # 打开图像
        images.append(os.path.join(dir_path, file))

new_db_connector = NewVectorDBConnector("table_demo",get_embeddings)
new_db_connector.add_images(images)

query  = "哪个模型在AGI Eval数据集上表现最差。得分多少"

results = new_db_connector.search(query, 1)
metadata = results["metadatas"][0]
print("====检索结果====")
print(metadata)
print("====回复====")
response = image_qa(query,metadata[0]["image"])
print(response)

输出结果如下：

====检索结果====
[{'image': 'llama2_page8/table_images\\page_1_0.png'}]
====回复====
在AGI Eval数据集上表现最差的模型是Falcon 7B，得分为21.2。

一些面向 RAG 的文档解析辅助工具

PyMuPDF: PDF 文件处理基础库，带有基于规则的表格与图像抽取（不准）
RAGFlow: 一款基于深度文档理解构建的开源 RAG 引擎，支持多种文档格式（火爆）（重要）
Unstructured.io: 一个开源+SaaS形式的文档解析库，支持多种文档格式
LlamaParse：付费 API 服务，由 LlamaIndex 官方提供，解析不保证100%准确，实测偶有文字丢失或错位发生
Mathpix：付费 API 服务，效果较好，可解析段落结构、表格、公式等，贵！

在工程上，PDF 解析本身是个复杂且琐碎的工作。以上工具都不完美，建议在自己实际场景测试后选择使用。

六、说说 GraphRAG

GraphRAG

什么是 GraphRAG：核心思想是将知识预先处理成知识图谱
优点：适合复杂问题，尤其是以查询为中心的总结，例如：“XXX团队去年有哪些贡献”
缺点：知识图谱的构建、清洗、维护更新等都有可观的成本
建议：
- GraphRAG 不是万能良药
- 领会其核心思想
- 遇到传统 RAG 无论如何优化都不好解决的问题时，酌情使用

总结

RAG 的流程

离线步骤：

文档加载
文档切分
向量化
灌入向量数据库

在线步骤

获取用户问题
用户问题向量化
检索向量数据库
将检索结果和用户问题填入 Prompt 模板
用最终获得的 Prompt 调用 LLM
用 LLM 生成回复

如果使用了开源 RAG，但是不好用怎么办？

检查预处理效果：文档加载是否正确，切割的是否合理
测试检验效果：问题检索回来的文本片段是否包含答案
测试大模型能力：给定问题和包含答案文本片段的前提下，大模型能不能正确回答问题

问题记录

一、在测试Embeddings模型时报错

报错代码：

test_query = ["测试文本"]
vec = get_embeddings(test_query)[0]
print(f"Total dimension: {len(vec)}")
print(f"First 10 elements: {vec[:10]}")

报错内容：

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[11], line 2
      1 test_query = ["测试文本"]
----> 2 vec = get_embeddings(test_query)[0]
      3 print(f"Total dimension: {len(vec)}")
      4 print(f"First 10 elements: {vec[:10]}")
Cell In[10], line 9, in get_embeddings(texts, model, dimensions)
      6     data = client.embeddings.create(
      7         input=texts, model=model, dimensions=dimensions).data
      8 else:
----> 9     data = client.embeddings.create(input=texts, model=model).data
     10 return [x.embedding for x in data]
File d:\Soft\Dev_Soft\anaconda3\envs\rag-learn\lib\site-packages\openai\resources\embeddings.py:128, in Embeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout)
    122             embedding.embedding = np.frombuffer(  # type: ignore[no-untyped-call]
    123                 base64.b64decode(data), dtype="float32"
    124             ).tolist()
    126     return obj
--> 128 return self._post(
    129     "/embeddings",
    130     body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams),
    131     options=make_request_options(
    132         extra_headers=extra_headers,
    133         extra_query=extra_query,
    134         extra_body=extra_body,
    135         timeout=timeout,
    136         post_parser=parser,
    137     ),
    138     cast_to=CreateEmbeddingResponse,
    139 )
File d:\Soft\Dev_Soft\anaconda3\envs\rag-learn\lib\site-packages\openai\_base_client.py:1242, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   1228 def post(
   1229     self,
   1230     path: str,
   (...)
   1237     stream_cls: type[_StreamT] | None = None,
   1238 ) -> ResponseT | _StreamT:
   1239     opts = FinalRequestOptions.construct(
   1240         method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   1241     )
-> 1242     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File d:\Soft\Dev_Soft\anaconda3\envs\rag-learn\lib\site-packages\openai\_base_client.py:919, in SyncAPIClient.request(self, cast_to, options, remaining_retries, stream, stream_cls)
    916 else:
    917     retries_taken = 0
--> 919 return self._request(
    920     cast_to=cast_to,
    921     options=options,
    922     stream=stream,
    923     stream_cls=stream_cls,
    924     retries_taken=retries_taken,
    925 )
File d:\Soft\Dev_Soft\anaconda3\envs\rag-learn\lib\site-packages\openai\_base_client.py:1023, in SyncAPIClient._request(self, cast_to, options, retries_taken, stream, stream_cls)
   1020         err.response.read()
   1022     log.debug("Re-raising status error")
-> 1023     raise self._make_status_error_from_response(err.response) from None
   1025 return self._process_response(
   1026     cast_to=cast_to,
   1027     options=options,
   (...)
   1031     retries_taken=retries_taken,
   1032 )
BadRequestError: Error code: 400 - {'error': {'message': 'Unknown model: text-embedding-ada-002 (request id: 2025050722295246817707943411229) (request id: 2025050722295245370563533xRaOCs)', 'type': '', 'param': '', 'code': 'unknown_model'}}

错误记录：

这里频繁报错的原因是使用dotenv和.env文件调用api不成功而导致的错误，后续在申请调用OpenAI API的程序处，通过在client.OpenAI()函数里面直接调用填写api key和请求地址解决了这个问题。至于为什么会导致这个错误目前还不清楚，等后续在研究看看。

更多实用文章和AI大模型应用开发文章欢迎到我个人博客来观看：墨宇Logic

墨宇Logic

https://ismoyuai.github.io/ai-da-mo-xing-ying-yong-xue-xi-bi-ji-rag-embedding-vector-zhi-shi-dian-xue-xi/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源墨宇Logic !

Markdown RAG Embeddings OpenAI

WIN11+WSL2+Ubuntu22.04+CUDA+ANACONDA3+Pytorch安装总结

关于在win11系统下，使用WSL2来安装Ubuntu22.04系统来配套安装深度学习框架，来进行AI大模型应用开发学习

2025-04-17 实用教程

Markdown

hexo+github+Netlify+ClouldFlare搭建个人博客

这个教程使用的个人博客框架是hexo，博客文件拖管于github，博客网站用netlify生成，国内访问采用Cloudflare进行CDN加速。

2025-04-11 博客搭建教程

css Markdown

【AI大模型应用学习笔记】RAG-Embedding-Vector知识点学习

一、什么是检索增强的生成模型（RAG）

1.1 大模型目前固有的局限性

1.2 检索增强生成（RAG）

什么是RAG？

如何理解RAG？

如何使用RAG?

1.3 RAG工作原理是什么？

1.4 RAG基本搭建流程

1.5 RAG技术架构

二、RAG的工作原理和基本搭建流程

2.1 文档的加载与切割

2.2 LLM接口封装

2.3 Prompt 模板

三、向量检索

3.1 什么是向量

3.1.1 文本向量（Text Embeddings）

3.1.2 文本向量是怎样得到的

3.2 向量间的相似度计算

3.3 向量数据库

3.3.1 向量数据库服务

3.3.2 主流向量数据库功能对比

3.4 基于向量检索的RAG

3.5 OpenAI 新发布的两个Embedding模型

四、 实战 RAG 系统的进阶知识

4.1 文本分割的粒度

4.2 检索后排序

一些 Rerank 的 API 服务

4.3 混合检索（Hybrid Search）

4.3.1 手写个简单的例子

五、PDF文档中的表格怎样处理

5.1 将每页 PDF 转成图片

5.2 识别文档（图片）中的表格

5.3 基于 GPT-4 Vision API 做表格回答

5.4 用GPT-4 Vision 生成表格(图片)描述,并向量化用于检索

一些面向 RAG 的文档解析辅助工具

六、说说 GraphRAG

总结

RAG 的流程

如果使用了开源 RAG，但是不好用怎么办？

问题记录

一、在测试Embeddings模型时报错

报错代码：

报错内容：

错误记录：

四、实战 RAG 系统的进阶知识