これは、なにをしたくて書いたもの?
MetaからLlama 3がリリースされました。
Meta、無料で商用可の新LLM「Llama 3」、ほぼすべてのクラウドでアクセス可能に - ITmedia NEWS
このLlama 3をOpenAI API互換のサーバーを持つllama-cpp-pythonおよびLocalAIで動かせそうなので、試してみることにしました。
Llama 3
Llama 3はMetaの公開しているLLMです。
Introducing Meta Llama 3: The most capable openly available LLM to date
パラメーターは8B、70Bの2種類で、ベースのモデルとInstruction tuning済みのモデルがそれぞれあります。
そしてこのモデルをllama-cpp-pythonやLocalAIで使いたいのですが。
まずllama.cppでは対応済み。
llama-cpp-pythonでも対応済みです。
Add Llama-3 chat format by andreabak · Pull Request #1371 · abetlen/llama-cpp-python · GitHub
LocalAIについてはテンプレートを使えば大丈夫そうです。
How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub
では、試してみようと思います。
オリジナルのモデルはこれらですが、
- meta-llama/Meta-Llama-3-8B · Hugging Face
- meta-llama/Meta-Llama-3-8B-Instruct · Hugging Face
- meta-llama/Meta-Llama-3-70B · Hugging Face
- meta-llama/Meta-Llama-3-70B-Instruct · Hugging Face
今回使うのはこちらのGGUFフォーマットかつ量子化済みのモデルにします。
QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face
環境
今回の環境はこちら。
$ python3 --version Python 3.10.12 $ pip3 --version pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
モデルをダウンロードする
こちらからモデルをダウンロードします。
QuantFactory/Meta-Llama-3-8B-Instruct-GGUF · Hugging Face
5GBほどのモデルです。
$ curl -L https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf?download=true -o Meta-Llama-3-8B-Instruct.Q4_K_M.gguf $ ll -h Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -rw-rw-r-- 1 xxxxx xxxxx 4.6G 4月 25 00:15 Meta-Llama-3-8B-Instruct.Q4_K_M.gguf
llama-cpp-pythonで試す
まずはllama-cpp-pythonで試してみます。
インストール。
$ pip3 install llama-cpp-python[server]
依存関係を含むバージョン。
$ pip3 list Package Version ----------------- ------- annotated-types 0.6.0 anyio 4.3.0 click 8.1.7 diskcache 5.6.3 exceptiongroup 1.2.1 fastapi 0.110.2 h11 0.14.0 idna 3.7 Jinja2 3.1.3 llama_cpp_python 0.2.64 MarkupSafe 2.1.5 numpy 1.26.4 pip 22.0.2 pydantic 2.7.1 pydantic_core 2.18.2 pydantic-settings 2.2.1 python-dotenv 1.0.1 PyYAML 6.0.1 setuptools 59.6.0 sniffio 1.3.1 sse-starlette 2.1.0 starlette 0.37.2 starlette-context 0.3.6 typing_extensions 4.11.0 uvicorn 0.29.0
起動。オプションに--chat_format llama-3
が必要です。
$ python3 -m llama_cpp.server --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --chat_format llama-3
動かしてみます。自己紹介をお願いしてみましょう。
$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \ '{"messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq { "id": "chatcmpl-ff1221b5-5555-4a32-9c1b-c3c2818efc02", "object": "chat.completion", "created": 1713972674, "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf", "choices": [ { "index": 0, "message": { "content": "I'd be happy to introduce myself.\n\nI am LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but rather a computer program designed to simulate conversation and answer questions to the best of my ability based on the knowledge and data I've been trained on.\n\nI'm constantly learning and improving my responses based on user interactions, so please bear with me if I don't always get it right at first. My goal is to assist and provide helpful information to those who interact with me, while also making our conversation as engaging and natural as possible.\n\nWhat would you like to talk about or ask?", "role": "assistant" }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 16, "completion_tokens": 138, "total_tokens": 154 } } real 1m9.754s user 0m0.051s sys 0m0.003s
日本語でも試してみましたが、実行時間がだいぶ伸びることに加えて日本語で返ってきませんでした…。
$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8000/v1/chat/completions -d \ '{"messages": [{"role": "user", "content": "あなたの自己紹介をしてください"}]}' | jq { "id": "chatcmpl-cd219a91-a85c-4ceb-ae8c-4b49a28cd881", "object": "chat.completion", "created": 1713972772, "model": "Meta-Llama-3-8B-Instruct.Q4_K_M.gguf", "choices": [ { "index": 0, "message": { "content": "Nice to meet you! 😊\n\nMy name is LLaMA, and I'm a large language model AI trained by Meta AI that can understand and respond to human input in a conversational manner. My primary function is to assist users with information queries, provide helpful responses, and even engage in creative conversations.\n\nHere are some interesting facts about me:\n\n1. **Language skills**: I'm fluent in multiple languages, including English, Japanese, Spanish, French, German, Italian, Chinese, and many more! 🌎\n2. **Knowledge base**: My training data consists of a massive corpus of text from the internet, which allows me to provide accurate answers to a wide range of questions.\n3. **Conversational abilities**: I can understand natural language processing (NLP) and respond accordingly, making it feel like you're having a conversation with a human! 💬\n4. **Creative capabilities**: I can generate text, poetry, stories, dialogues, and even entire scripts!\n5. **Continuous learning**: My training is ongoing, so I'm always improving my understanding of language and updating my knowledge base.\n\nI'm here to help answer your questions, provide information, or simply chat about any topic you're interested in! What would you like to talk about? 🤔", "role": "assistant" }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 19, "completion_tokens": 262, "total_tokens": 281 } } real 2m42.950s user 0m0.042s sys 0m0.018s
意味は通じているようなのですが。
Llama 3は英語で使うことにしましょう。
ところで、--chat_format llama-3
というのは以下で利用されるものですね。
https://github.com/abetlen/llama-cpp-python/blob/v0.2.64/llama_cpp/llama_chat_format.py#L929-L946
これらのトークンについてですが
_roles = dict( system="<|start_header_id|>system<|end_header_id|>\n\n", user="<|start_header_id|>user<|end_header_id|>\n\n", assistant="<|start_header_id|>assistant<|end_header_id|>\n\n", ) _begin_token = "<|begin_of_text|>" _sep = "<|eot_id|>"
こちらに記載があります。
Meta Llama 3 | Model Cards and Prompt formats
LocalAIで試す
次は、LocalAIで試しましょう。
ダウンロード。
$ curl -LO https://github.com/mudler/LocalAI/releases/download/v2.12.4/local-ai-avx2-Linux-x86_64 $ chmod a+x local-ai-avx2-Linux-x86_64 $ ./local-ai-avx2-Linux-x86_64 --version LocalAI version v2.12.4 (0004ec8be3ca150ce6d8b79f2991bfe3a9dc65ad)
models
ディレクトリに量子化されたLlama 3のモデルを配置します。
$ tree models models └── Meta-Llama-3-8B-Instruct.Q4_K_M.gguf 0 directories, 1 file
設定ファイルを用意します。
local-ai-config.yaml`
- name: llama-3-8b-instruct backend: llama-cpp mmap: true context_size: 8192 f16: true stopwords: - <|im_end|> - <dummy32000> - "<|eot_id|>" parameters: model: Meta-Llama-3-8B-Instruct.Q4_K_M.gguf template: chat_message: | <|start_header_id|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}<|end_header_id|> {{ if .FunctionCall -}} Function call: {{ else if eq .RoleName "tool" -}} Function response: {{ end -}} {{ if .Content -}} {{.Content -}} {{ else if .FunctionCall -}} {{ toJson .FunctionCall -}} {{ end -}} <|eot_id|> function: | <|start_header_id|>system<|end_header_id|> You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools: <tools> {{range .Functions}} {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }} {{end}} </tools> Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}<|eot_id|><|start_header_id|>assistant<|end_header_id|> Function call: chat: | <|begin_of_text|>{{.Input }} <|start_header_id|>assistant<|end_header_id|> completion: | {{.Input}} usage: | curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "llama3-8b-instruct", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }'
主な内容はLlama 3向けのテンプレートを入れたもので、このあたりを参考に作成しています。
How to run llama3? · mudler LocalAI · Discussion #2076 · GitHub
起動。
$ ./local-ai-avx2-Linux-x86_64 --config-file local-ai-config.yaml --models-path models --threads 4
確認。
$ time curl -s -XPOST -H 'Content-Type: application/json' localhost:8080/v1/chat/completions -d \ '{"model": "llama-3-8b-instruct", "messages": [{"role": "user", "content": "Could you introduce yourself?"}]}' | jq { "created": 1714056935, "object": "chat.completion", "id": "059aff63-0d6a-4d29-ba9c-02b4a467f03a", "model": "llama-3-8b-instruct", "choices": [ { "index": 0, "finish_reason": "stop", "message": { "role": "assistant", "content": "I'd be happy to introduce myself!\n\nI'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm not a human, but I'm designed to simulate conversation and answer questions to the best of my ability. I can provide information on a wide range of topics, and I'm constantly learning and improving my responses.\n\nI don't have personal experiences or emotions like humans do, but I'm here to help you with any questions or topics you'd like to discuss. I'm happy to chat and provide information to the best of my ability." } } ], "usage": { "prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0 } } real 0m46.588s user 0m0.043s sys 0m0.006s
初回のモデルのロードには、3分ほどかかりましたが…。
11:55PM INF Trying to load the model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper 11:55PM INF [llama-cpp] Attempting to load 11:55PM INF Loading model 'Meta-Llama-3-8B-Instruct.Q4_K_M.gguf' with backend llama-cpp 11:58PM INF [llama-cpp] Loads OK
日本語の確認は、こちらではパスします。試してはみましたが、遅い&やっぱり英語で返ってきました…。
こんなところでしょうか。
おわりに
MetaのLLM、Llama 3をllama-cpp-pythonおよびLocalAIで試してみました。
最小のモデルが8BとLlama 2よりもちょっと大きいのですが、ある意味予想通りでしたが割とあっさり使えて良かったです。