huggingface에 있는 Llama4 Maverick사용해보기

dongok218 2025. 4. 14. 18:41

from huggingface_hub import login
login("read 토큰")

우선 이런식으로 로그인을 해줘야 한다.

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

#텍스트 기반 메세지만 이용
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What's the diffrence between RAG and web search in GPT-4?"}
        ]
    }
]

inputs = processor.apply_chat_template(
    message,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensor="pt",
).to(model.device)

outputs = model.generate(
    # **inputs이란, inputs이란 딕셔너리를 generate()함수의 인자로 펼쳐서 전달
    **inputs,
    max_new_tokens=256,
)

#응답 디코딩
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]

print(response)

난 huggingface에 있는 'Llama-4-Meverick-17B-128E-Instruct'라이브러리를 사용했다. 참고로 이 라이브러리를 사용하려면 먼저 huggingface에서 해당 자료에 대한 사용 권한을 승인받아야 한다.

🧠 이 모델은 어떤 모델인가요?

meta-llama/Llama-4-Maverick-17B-128E-Instruct

이름 의미:
- Llama-4: Meta의 최신 4세대 LLM
- Maverick: LLaMA 4의 고성능 버전
- 17B: 170억 파라미터 (base 크기)
- 128E: 128개의 Experts (MoE 구조)
- Instruct: 대화 지시문에 최적화된 버전

구성:

✅ Text + Image 입력 가능 (멀티모달)
✅ RAG, 챗봇, 이미지 기반 QA 등 다양한 작업에 활용 가능

요약 📝

항목내용

무엇을?	LLaMA 4 Maverick 모델의 55개 가중치(.safetensors) 파일
왜 필요한가?	모델을 로컬에서 직접 실행하려면 반드시 필요
얼마나 큰가?	전체 용량 약 40~60GB
끝나면?	Python에서 모델 로딩 후 텍스트/이미지 응답 생성 가능