2023年的深度学习入门指南(6) - 在你的电脑上运行大模型

显示全部楼层 · 2023-8-23 11:48:24

2023年的深度学习入门指南(6) - 在你的电脑上运行大模型

上一篇我们先容了大模型的基础，自注意力机制以及着实现Transformer模块。由于Transformer被PyTorch和TensorFlow等框架所支持，以是我们只要可以或许配置好框架的GPU大概其他加快硬件的支持，就可以运行起来了。
而想运行大模型，恐怕就没有这么轻易了，很有大概你必要一台Linux电脑。由于现在盛行的AI软件一样平常都依靠大量的开源工具，尤其是要举行优化的情况下，很大概必要从源码举行编译。一旦涉及到开源软件和编译这些事故，在Windows上的难度就变成hard模式了。
大部分开辟者自身都是在开源系统上做开辟的，Windows的适配关注得较少，以致完全不关心。虽然从Cygwin, MinGW, CMake到WSL，各方都为Windows上支持大量Linux开源库举行了不少积极，但是就像在Linux上没有Windows那么多游戏一样，这是生态的标题。
我们先选取几个Windows的兼容性稍好的项目，让用Windows的同砚们也可以体验本机的大模型。
Nomic AI gpt4all (基于LLaMA)

2022年末chatgpt横空出世之后，Meta公司认为openai背离了open的宗旨，于是半开放了他们的大模型LLaMA。半开放的缘故起因是，网络的权重文件必要跟Meta公司申请。
LLaMA主要是针对英语质料举行训练，也引用了部分使用拉丁字母和西里尔字母的语言。它的分词器可以支持汉语和日语，但是并没有使用汉语和日语的质料。
由于不并对全部人开放，我们解说LLaMA是没故意义的。但是我们可以实验一些基于LLaMA的项目，好比Nomic AI的gpt4all。
gpt4all的知心之处是针对Windows, M1 Mac和Intel Mac三种平台都举行了适配，固然默认肯定是支持Linux的。而且，推理使用CPU就可以。
下面我们就将其运行起来吧。
起首下载gpt4all的代码：

git clone https://github.com/nomic-ai/gpt4all

复制代码

第二步，下载量化之后的网络权重值文件：https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
第三步，将下载的gpt4all-lora-quantized.bin放在gpt4all的chat目次下
第四步，运行gpt4all-lora-quantized可实验文件。以Windows平台为例，就是运行gpt4all-lora-quantized-win64.exe。可以在powershell中实验，也可以直接点击。
运行后，当加载完模型之后，我们就可以跟gpt4all对话了：

下面是gpt4all生成的C++快速排序的代码：
Please generate quick sort algorithm in C++ language
Here’s how you can implement Quick Sort Algorithm using C++ :

// Implementation of QuickSort() function for array A[] with size N and pivot value P[N/2]
void QS(int *A, int n) { // Function prototype declaration
if (n <=1 ) return; /* If there is only one element in the input Array */
// QuickSort() function for array A[] with size N and pivot value P[N/2]
else{ //else statement to check condition of quick sort algorithm.
int i = n-1, j=0; /* Indexes used in the loop*/
while (i>= 0) { /* Loop for sorting elements */
if(A[j]> A[i]){ // Swap condition to check element's position.
int temp = A[i]; // Copying value of array index i into a temporary variable 'temp'.
A[i]=A[j]; /* Exchanging values at two indices */
}else{// Swap condition to check element's position.
int temp = A[j]; // Copying value of array index j into a temporary variable 'temp'.
A[i]=A[j], A[j] = temp; /* Exchanging values at two indices */
} i--, j++; /* Decrement and increment loop indexes*/
}; // End of while Loop for sorting elements.
}//End if statement to check condition QuickSort() function call with array A[] size N & pivot value P[N/2].

复制代码

恭喜你！你终于乐成在本身的电脑上跑起了一个跟chatgpt一样了不得的预训练大模型。
这个模型虽然能力比chatgpt差了不少，但是它的模型来自LLaMA，训练数据部分取自chatgpt的对话记录，接纳了低秩顺应(LoRA: Low-Rank Adaptation)和量化的技能来缩减模型巨细。每一项技能都是现在的前沿技能。
清华团队的单卡大模型：ChatGLM-6B

清华团队的ChatGLM-6B也是一个目的为运行在单GPU卡上的大模型，其底座为1300万参数的ChatGLM。
下图是2022年10月斯坦福大学对当时主要大模型的评测效果：

从上图数据来看，虽然ChatGLM是落后于InstuctGPT的chatgpt，但是从跟其它模型的比力来看气力还不错。
我们就来试试基于ChatGLM的压缩后的效果。
我们先下载代码：

git clone https://github.com/THUDM/ChatGLM-6B

复制代码

然后安装其依靠的库：

pip install -r requirements.txt

复制代码

上面的gpt4all的Python编程接口不支持Windows，而ChatGLM-6B基于Hugging Face Transformer库开辟，支持Windows下的Python编程，我们写个代码来调用它吧：

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "用C++实现快速排序", history=[])
print(response)

复制代码

输出的效果如下：

#include <iostream>
using namespace std;
void swap(int& a, int& b) {
int temp = a;
a = b;
b = temp;
}
int partition(int arr[], int low, int high) {
int pivot = arr[high];
int i = (low - 1);
for (int j = low; j < high - 1; j++) {
if (arr[j] < pivot) {
i++;
swap(arr[i], arr[j]);
}
}
swap(arr[i+1], arr[high]);
return i+1;
}
void quicksort(int arr[], int low, int high) {
if (low < high) {
int pi = partition(arr, low, high);
quicksort(arr, low, pi - 1);
quicksort(arr, pi + 1, high);
}
}
int main() {
int arr[] = {5, 2, 9, 1, 6, 3, 8};
int n = sizeof(arr) / sizeof(arr[0]);
quicksort(arr, 0, n-1);
cout << arr[0] << endl;
return 0;
}

复制代码

是不是效果还可以？有点chatgpt的意思了吧？
假如你的PyTorch大概Tensorflow的GPU支持装好了的话，这个推理就是用GPU来完成的。我选用了最省显存的4位量化，假如你的显卡更好，可以选择压缩比更低一些的模型。
这里面我们可以引出Transformer期间的门户，hugging face。我们在上面代码中所使用的from的 transformers库，就是hugging face出品的。

from transformers import AutoTokenizer, AutoModel

复制代码

从上图我们可以看到，Hugging face根本上就是各种Transformer模型的集散地。使用Hugging face的接口，就可以使用根本上全部的开源的大模型。
大模型是怎样炼成的

虽然网络权值必要申请，但是Meta的LLaMA大模型的模型代码是开源的。我们来看看LLaMA的Transformer跟我们上一节构造的标准的Transformer有什么区别：

class Transformer(nn.Module):
def __init__(self, params: ModelArgs):
super().__init__()
self.params = params
self.vocab_size = params.vocab_size
self.n_layers = params.n_layers
self.tok_embeddings = ParallelEmbedding(
params.vocab_size, params.dim, init_method=lambda x: x
)
self.layers = torch.nn.ModuleList()
for layer_id in range(params.n_layers):
self.layers.append(TransformerBlock(layer_id, params))
self.norm = RMSNorm(params.dim, eps=params.norm_eps)
self.output = ColumnParallelLinear(
params.dim, params.vocab_size, bias=False, init_method=lambda x: x
)
self.freqs_cis = precompute_freqs_cis(
self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
)

复制代码

我们看到，为了加强并发训练，Meta的全毗连网络用的是它们本身的ColumnParallelLinear。它们的词嵌入层也是本身做的并发版。
根据条理数，它也是堆了多少层的TransformerBlock。
我们再来看这个Block:

class TransformerBlock(nn.Module):
def __init__(self, layer_id: int, args: ModelArgs):
super().__init__()
self.n_heads = args.n_heads
self.dim = args.dim
self.head_dim = args.dim // args.n_heads
self.attention = Attention(args)
self.feed_forward = FeedForward(
dim=args.dim, hidden_dim=4 * args.dim, multiple_of=args.multiple_of
)
self.layer_id = layer_id
self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)
def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward.forward(self.ffn_norm(h))
return out

复制代码

我们发现，它没有使用标准的多头注意力，而是本身实现了一个注意力类。

class Attention(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
self.n_local_heads = args.n_heads // fs_init.get_model_parallel_world_size()
self.head_dim = args.dim // args.n_heads
self.wq = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wk = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wv = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wo = RowParallelLinear(
args.n_heads * self.head_dim,
args.dim,
bias=False,
input_is_parallel=True,
init_method=lambda x: x,
)
self.cache_k = torch.zeros(
(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
).cuda()
self.cache_v = torch.zeros(
(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
).cuda()

复制代码

闹了半天就是支持了并发和加了cache的多头注意力，K,V,Q穿了个马甲，本质上照旧多头自注意力。
其它风趣的工程

LM Flow

LM Flow也是近来很火的项目，它是香港科技大学在LLaMA的基础上搞的全流程开源的，可以在单3090 GPU上举行训练的工程。
其地点在：https://github.com/OptimalScale/LMFlow
LMFlow现在的独特价值在于，它提供的流程比力完整。
好比，在现在的开源项目中，LMFlow是少有的提供了Instruction Tuning的工程。
我们来看个Instruction Tuning的例子：

{"id": 0, "instruction": "The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.", "input": "If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.", "infer30b_before_item": " Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n Output: Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated,", "infer30b_after_item": " \n Output: If you have any questions about my rate or need to adjust the scope for this project, please let me know. \n\n", "infer13b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n", "infer13b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n", "infer7b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nOutput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by", "infer7b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n"}

复制代码

这让我们见识到了，原来纠错就是这样搞的。这是LLaMA中所缺少的。
HuggingGPT

近来浙大和微软的团队又推出了充实使用Hugging Face的门户中枢地位的Jarvis工程。

很不幸的是，上面的两个工程，加上前面工程的高级应用，很难在Windows上面完成。我们后面将同一先容这些必要在Linux情况下的实验。
小结

通过对大模型举行剪枝、降秩、量化等本事，我们是可以在资源受限的电脑上运行推理的。固然，性能是有所丧失的。我们可以根据业务场景去均衡，假如能用prompt engineer办理最好
HuggingFace是预训练大模型的编程接口和模型集散地
大模型的根本原理仍旧是我们上节学习的自注意力模型

来源：https://blog.csdn.net/lusing/article/details/130051210
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

2023年的深度学习入门指南(6) - 在你的电脑上运行大模型

本帖子中包含更多资源