The landscape of LLMs has been ever-evolving for the last 2 years. Newer LLMs are being designed to be smaller and smaller, while retaining most of the performance and accessibility to users. We have to be grateful for companies such as Meta, whose open-source models could be finetuned to perform exceptionally well in specific areas, such as sentiment analysis in customer support interactions. And that's exactly what we will be doing in this blog. Furthermore, we will also push the models onto Hugging Face hub to provide instant access.
The family of the recently introduced Llama 3.2 open-source models has two variations: the lightweight and vision models. The vision models excel in image reasoning and bridging vision with language, while the lightweight models are good at multilingual text generation and tool calling for edge devices.
We need Hugging Face API to save the model and Weights & Bias to track the performance over time.
1. Setup
This is the packages we will be using:
%%capture
%pip install -U transformers
%pip install -U datasets
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install -U bitsandbytes
%pip install -U wandb
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
DataCollatorForLanguageModeling,
)
from peft import (
LoraConfig,
PeftModel,
prepare_model_for_kbit_training,
get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format
Logging into our Hugging Face account and W&B account, respectively:
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_AUTH_TOKEN")
login(token=hf_token)
wb_token = user_secrets.get_secret("wandb_api_key")
wandb.login(key=wb_token)
run = wandb.init(
project='Fine-tune Llama 3.2 on Customer Support Dataset',
job_type="training",
anonymous="allow"
)
Let's name our models:
base_model = "/kaggle/input/llama-3.2/transformers/3b-instruct/1"
new_model = "llama-3.2-3b-it-Ecommerce-ChatBot"
dataset_name = "bitext/Bitext-customer-support-llm-chatbot-training-dataset"
2. Loading Model and Tokenizer
This Python snippet checks if the CUDA device has compute capability 8.0 or higher. If true, it installs flash-attn
, sets torch_dtype
to bfloat16
, and uses flash_attention_2
. Otherwise, it defaults to float16
and uses PyTorch's standard "eager" attention implementation.
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
!pip install -qqq flash-attn
torch_dtype = torch.bfloat16
attn_implementation = "flash_attention_2"
else:
torch_dtype = torch.float16
attn_implementation = "eager"
This snippet, in addition, configures and loads a quantized large language model (LLM) using QLoRA (Quantized Low-Rank Adaptation). The BitsAndBytesConfig
specifies a 4-bit quantization scheme using nf4
(Normalized Float 4), a specialized format that enhances precision in low-bit quantization. The bnb_4bit_compute_dtype
is set to torch_dtype
, which is determined earlier based on the device's compute capability. Additionally, bnb_4bit_use_double_quant=True
enables double quantization, a technique that applies an additional quantization step to reduce memory usage and improve computational efficiency. We usually apply the configurations when working with very large language models such as LLaMA, GPT-3.
# QLoRA config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
)
# Load Model
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map='auto',
attn_implementation=attn_implementation
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
3. Processing the Dataset
We will load the Bitext-customer-support-llm-chatbot dataset from the Hugging Face hub. It is a hybrid synthetic dataset that we will use to create our own customized customer support chatbot.
We will load, shuffle, and select only 1000 samples. We are fine-tuning the model on a small subset to reduce the training time, but you can always select the full model.
Next, we will create the "text" column using the system instructions, user queries, and assistant responses. Then, we will convert the JSON response into the chat format.
dataset = load_dataset(dataset_name, split='train')
dataset = dataset.shuffle(seed=65).select(range(1000))
instruction = """You are a top-rated customer service agent named John.
Be polite to customers and answer all their questions.
"""
def format_chat_template(row):
row_json = [{"role": "system", "content": instruction },
{"role": "user", "content": row["instruction"]},
{"role": "assistant", "content": row["response"]}]
row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
return row
dataset = dataset.map(
format_chat_template,
num_proc= 4,
)
# Split dataset into train and test sets
split_dataset = dataset.train_test_split(test_size=0.1) # 10% for evaluation
# Assign splits to variables
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]
This is the overview of the train dataset, and an example of it after we have reformatted the structure:
Dataset({
features: ['flags', 'instruction', 'category', 'intent', 'response', 'text'],
num_rows: 900
})
---
'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a top-rated customer service agent named John. \n Be polite to customers and answer all their questions.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nerror correcting delivery address<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI apologize for any inconvenience caused by the error in your delivery address. To correct it, please follow these steps: \n1. Log in to your account on our website.\n2. Go to the "My Account" or "Profile" section.\n3. Look for the "Shipping Address" or "Delivery Address" option.\n4. Click on it to edit your address.\n5. Update the incorrect information with the correct details.\n6. Save the changes.\n7. Double-check that the address is accurate and complete.\nIf you encounter any difficulties during this process, please don\'t hesitate to reach out to our customer support team for further assistance.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'
4. Setting Up the Model
We will identify all bnb.nn.Linear4bit
layers in a given model. These layers will constitute the LoRA adapter and we are going to only finetune on that, making training more memory-efficient.
import bitsandbytes as bnb
# Find all linear layers for LoRA
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # Exclude lm_head for stability
lora_module_names.remove('lm_head')
return list(lora_module_names)
modules = find_all_linear_names(model)
Now, we proceed to configure and apply LoRA to a causal language model using peft (Parameter-Efficient Fine-Tuning)
# LoRA config
peft_config = LoraConfig(
r=16, # ranks of low-rank matrices
lora_alpha=32, # a scaling factor
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM", # autoregressive manner
target_modules=modules # applied to the specified layers
)
# tokenizer.chat_template = None
# model, tokenizer = setup_chat_format(model, tokenizer)
model = get_peft_model(model, peft_config)
Up next, we must tokenize our dataset, since machines only understand embeddings. But before that, we must add the id
of the padding token (if there is none)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
if model.config.pad_token_id is None:
model.config.pad_token_id = model.config.eos_token_id
# Tokenize the dataset
def preprocess_function(examples):
# Tokenize the text
tokenized = tokenizer(
examples["text"], # Column containing formatted text
truncation=True,
padding="max_length",
max_length=512, # Adjust as needed
)
# Add labels (same as input_ids, but padding tokens are set to -100)
tokenized["labels"] = [
[label if label != tokenizer.pad_token_id else -100 for label in labels]
for labels in tokenized["input_ids"]
]
return tokenized
# Apply preprocessing to the dataset
train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)
And this is the overview of the dataset subsequently:
Dataset({
features: ['flags', 'instruction', 'category', 'intent', 'response', 'text', 'input_ids', 'attention_mask', 'labels'],
num_rows: 900
})
We might also need to initialize data collator, which is responsible for dynamically batching and padding tokenized text during training.
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
5. Training and Inference
Now, we configure of the arguments as follows:
training_arguments = TrainingArguments(
output_dir=new_model,
per_device_train_batch_size=1, # the less the more memory-efficient
per_device_eval_batch_size=1,
gradient_accumulation_steps=2, # the more the more memory-efficient
optim="paged_adamw_32bit",
num_train_epochs=1, # depends on your needs
eval_strategy="steps",
eval_steps=0.2,
logging_steps=1,
warmup_steps=10,
logging_strategy="steps",
learning_rate=2e-4,
fp16=(torch_dtype == torch.float16),
bf16=(torch_dtype == torch.bfloat16),
group_by_length=True,
report_to="wandb" # since we have logged in earlier
)
And now finally is our SFTTrainer
class:
# Initialize trainer
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset, # Tokenized training dataset
eval_dataset=eval_dataset, # Tokenized evaluation dataset
peft_config=peft_config,
data_collator=data_collator, # Use the default data collator
args=training_arguments,
)
trainer.train()
After training for 1 epoch (450 steps), this is the result:
TrainOutput(global_step=450, training_loss=0.7691040209929149, metrics={'train_runtime': 732.1833, 'train_samples_per_second': 1.229, 'train_steps_per_second': 0.615, 'total_flos': 7860495738470400.0, 'train_loss': 0.7691040209929149, 'epoch': 1.0}, eval_loss=0.63883)
Now we try to infer on a custom text:
messages = [{"role": "system", "content": instruction},
{"role": "user", "content": "I bought the same item twice, cancel order {{Order Number}}"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text.split("assistant")[1])
---
I've noticed that you've purchased the same item twice, and you would like to cancel order {{Order Number}}. I apologize for any inconvenience this may have caused. To assist you with canceling your order, could you please provide me with some additional information? Specifically, I would appreciate it if you could confirm the order number and any other relevant details. Once I have this information, I will promptly guide you through the cancellation process. Thank you for bringing this to my attention, and I assure you that we will do everything we can to resolve this matter for you. How can I assist you further?
6. Merging Into Base Model
Suppose we have the path of the base model and the finetuned adapter layer:
base_model_url = "/kaggle/input/llama-3.2/transformers/3b-instruct/1"
new_model_url = "/kaggle/input/llama-3.2-3b-it-ecommerce-chatbot/transformers/default/1/Savoxismllama-3.2-3b-it-Ecommerce-ChatBot"
Let's merge them together:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import torch
from trl import setup_chat_format
# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model_url)
base_model_reload= AutoModelForCausalLM.from_pretrained(
base_model_url,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.float16,
device_map="auto",
)
# Merge adapter with base model
# base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)
model = PeftModel.from_pretrained(base_model_reload, new_model_url)
model = model.merge_and_unload()
You can see the finetuned model of mine via this link: Now, let's load the final model and test it on our own examples:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Savoxism/llama-3.2-3b-it-Ecommerce-ChatBot-Final"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Savoxism/llama-3.2-3b-it-Ecommerce-ChatBot-Final")
model.to(device)
and...
instruction = """You are a top-rated customer service agent named John.
Be polite to customers and answer all their questions.
"""
messages = [{"role": "system", "content": instruction},
{"role": "user", "content": "I have to see what payment payment modalities are accepted"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("device")
outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text.split("assistant")[1])
---
I'll do my best to assist you in exploring the available payment methods we accept. Let me provide you with a comprehensive list:
1. Credit/Debit Card: We accept major credit and debit cards such as Visa, Mastercard, and American Express.
2. PayPal: A widely recognized and secure online payment method.
3. Bank Transfer: You can make payments directly from your bank account using your bank details.
4. Apple Pay: For Apple device users, Apple Pay offers a convenient and secure payment experience.
5. Google Wallet: Google Wallet is a fast and secure payment option for Google users.
Please note that payment methods may vary depending on your location and the specific services or products you are purchasing. If you have any specific questions or
7. Conclusion
This task allows me to take advantage of parameter-efficient methods when dealing with large models.