Unlocking the True Potential of LLMs: Beyond Text Generation with Custom Heads

Large language models (LLMs) have revolutionized natural language processing, yet their most celebrated application—text generation—scarcely scratches the surface of their capabilities. The notion that LLMs are primarily tools for crafting narratives or responses misses their architectural flexibility. As articulated in various analyses, "If your LLM model is used to generate text, you are not using it correctly." Instead, the real power lies in attaching custom heads to the LLM backbone, transforming it into a specialized engine for tasks like classification, embedding, reward modeling, and more. This approach leverages the LLM's deep contextual understanding while minimizing the computational overhead of autoregressive decoding.

By replacing or augmenting the standard language modeling (LM) head—a linear layer projecting hidden states to vocabulary logits—with task-specific architectures, practitioners can deploy LLMs efficiently for inference-heavy applications. These custom heads, often adding mere megabytes to VRAM, enable real-world deployments in 2025, from content moderation to retrieval-augmented systems. Drawing from established frameworks like Hugging Face Transformers and insights from recent studies, this article explores key custom head types, their implementations, and practical examples. We structure the discussion by head type, highlighting pseudo-code, real-world use cases, and references.

If your LLM model is used to generate text, you are not using it correctly illustration

Reward Modeling Heads: Guiding Alignment Without Generation

Reward modeling heads repurpose LLMs to score outputs based on human or AI preferences, forming the backbone of reinforcement learning from human feedback (RLHF) and its scalable variant, reinforcement learning from AI feedback (RLAIF). Far from generating text, these heads output scalars or low-dimensional vectors to quantify qualities like helpfulness, harmlessness, or factual accuracy.

A typical reward head is a simple linear projection from the LLM's hidden states (e.g., 4096 dimensions) to a single scalar. For instance, Starling-RM-7B-alpha, built on Llama-2-7B-Chat, replaces the LM head with this projection and trains on preference datasets like Nectar using a Bradley-Terry loss. During inference, it processes prompt-response pairs and yields a score where higher values indicate preferred outputs—helpful for RLHF pipelines without token-by-token generation.

Pseudo-code for a basic reward head:

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    def __init__(self, base_llm):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state[:, 0]  # CLS token
        reward = self.reward_head(pooled)
        return reward.squeeze(-1)  # Scalar per input

In practice, models like ArmoRM-L1B or Open-R1-1B use this for RLHF fine-tuning, adding negligible VRAM (<1 MB at FP16). A 2025 study on RM-R1 demonstrates reasoning-enhanced reward models (ReasRMs) that generate rubrics before scoring, outperforming GPT-4o on RewardBench by 13.8% in preference accuracy. For deployment, integrate with PPO via libraries like TRL from Hugging Face, enabling alignment in chat systems without full generation cycles.

References: Starling-RM-7B-alpha on Hugging Face; RM-R1 on arXiv.

Classification Heads: Precision Tasks in Moderation and Detection

Classification heads extend LLMs to multi-class or binary decisions, ideal for sentiment analysis, toxicity detection, or spam filtering. These replace the LM head with a linear layer projecting pooled hidden states to 2–10 classes, consuming trivial resources (8–40K parameters, negligible VRAM).

Consider toxicity classification: A linear head on a base LLM like BERT processes text and outputs logits for categories such as toxic, obscene, or threat, as in the Jigsaw Toxic Comment Classification Challenge. This setup powers moderation in online discussions, where LLMs classify intent without generating replies—crucial for scalable platforms.

Pseudo-code:

import torch
import torch.nn as nn

class LLMWithClassificationHead(nn.Module):
    def __init__(self, base_llm, num_classes):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.classifier = nn.Linear(hidden_size, num_classes)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.pooler_output if hasattr(outputs, 'pooler_output') else outputs.last_hidden_state[:, 0]
        logits = self.classifier(pooled)
        return logits

Real-world deployment includes Starling-7B for RLAIF-based harmlessness scoring, or fine-tuned models on UCI SMS Spam for spam detection. A 2025 paper on personalized harmful content detection uses in-context learning with such heads on Llama-3-8B, achieving 97.1% F1 on TextDetox while allowing user-defined categories via prompts—no retraining needed.

This head type shines in low-latency scenarios, like real-time comment moderation, outperforming traditional classifiers by leveraging LLM context. For training, use datasets from Hugging Face's datasets library with cross-entropy loss.

References: Classification of Intent in Moderating Online Discussions on ScienceDirect; Beyond One-Size-Fits-All on arXiv.

Embedding Heads: Fueling Retrieval and Similarity

Embedding heads transform LLMs into dense vector generators for semantic search, reranking, or duplicate detection. An MLP (e.g., 4096 → 4096 → 1024) pools hidden states via mean or CLS token, producing fixed-dimensional vectors (30–80 MB VRAM). Models like Snowflake Arctic Embed L v2.0, based on BGE-small-v2, optimize for retrieval with 1024 dimensions and 8192-token contexts.

These heads enable non-generative uses like sentence similarity: Input text yields a vector; cosine similarity ranks candidates. Voyage-lite and BGE-small-v2 deploy this for RAG pipelines, where embeddings index documents without LLM decoding.

Pseudo-code:

import torch
import torch.nn as nn

class EmbeddingModel(nn.Module):
    def __init__(self, base_llm, embed_dim):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.embed_head = nn.Linear(hidden_size, embed_dim) if hidden_size != embed_dim else None

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
        if self.embed_head:
            embedding = self.embed_head(pooled)
        else:
            embedding = pooled
        return embedding

Multi-head contrastive (Siamese) variants, using 2× (4–20M parameters), excel in reranking: Train with InfoNCE loss on query-document pairs. A 2025 arXiv paper on Contrastive Retrieval Heads shows aggregating <1% of attention heads as CoRe heads boosts BEIR benchmarks by isolating discriminative signals, reducing latency 20% via layer pruning.

For implementation, fine-tune with Sentence Transformers; integrate into FAISS for vector search. This powers efficient RAG without full LLM inference.

References: Snowflake Arctic Embed L v2.0 on Hugging Face; Contrastive Retrieval Heads on arXiv.

Sequence Tagging and Span Extraction: Structured Extraction

For named entity recognition (NER), PII redaction, or extractive QA, sequence tagging heads apply per-token linear projections (4096 × n_tags, <50 MB). CRF layers enhance transitions, while span extraction uses dual heads for start/end logits (<10 MB).

In NER, tag tokens as B-PER/I-PER for persons; datasets like CoNLL-2003 train via token classification loss. Private AI's ner/text endpoint uses this for PII detection, returning entities like names without redaction.

Pseudo-code for tagging:

import torch
import torch.nn as nn

class SequenceTaggingModel(nn.Module):
    def __init__(self, base_llm, num_tags):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.tagger = nn.Linear(hidden_size, num_tags)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        hidden_states = outputs.last_hidden_state
        logits = self.tagger(hidden_states)  # Per token
        return logits

For QA, SQuAD fine-tuning predicts spans: Start/end heads on BERT yield 88.67 F1. Fin-ExBERT (2025) adapts this for financial transcripts, achieving 0.84 F1 on CreditCall12H via GNN integration.

These heads support slot filling in dialog systems (e.g., Snips dataset with BiLSTM), extracting parameters like movie names without generative overhead.

References: Token Classification in Hugging Face LLM Course; Fin-ExBERT on arXiv.

Advanced Heads: MoE, Tool-Calling, and Verification

Mixture-of-Experts (MoE) heads (100–300M parameters, 400 MB–1 GB VRAM) enable multi-tasking: Gate networks route to 8 experts (e.g., Gorilla-1B). OLMoE (7B total, 1B active) trains 2× faster than dense models, ideal for tool heads.

Tool-calling heads (1–5 MB) output parallel logits for 50–200 tools, enabling ReAct-style function calls in one pass. DeepSeek-R1 supports this via JSON schemas, querying weather APIs without generation.

Verification heads (8–20M parameters) for entailment (e.g., Atlas-1B) score RAG fact-checking: Linear to 3 classes (entail/contradict/neutral).

Pseudo-code for tool-calling:

import torch
import torch.nn as nn

class ToolCallingModel(nn.Module):
    def __init__(self, base_llm, num_tools):
        super().__init__()
        self.base_llm = base_llm
        hidden_size = base_llm.config.hidden_size
        self.tool_head = nn.Linear(hidden_size, num_tools)

    def forward(self, inputs):
        outputs = self.base_llm(**inputs)
        pooled = outputs.last_hidden_state[:, 0]
        tool_logits = self.tool_head(pooled)
        return tool_logits

Regression heads add uncertainty (negligible VRAM) for confidence calibration, as in FineCE (2025), improving AUROC by 39.5% on GSM8K.

Training template:

def train_llm_with_head(model, data_loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for batch in data_loader:
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_loader)

Conclusion: Efficiency and Elegance in LLM Deployment

Custom heads unlock LLMs for precise, non-generative tasks, conserving resources while harnessing contextual depth. From reward scalars in RLHF to embeddings in RAG, these adaptations—detailed in 2025 benchmarks like RewardBench and BEIR—prove indispensable for production systems. Libraries like Hugging Face's Transformers simplify attachment, as seen in transformer-heads for multi-task finetuning. Explore these via Hugging Face Model Hub or arXiv surveys on LLM architectures. By shifting focus from generation, we elevate LLMs to sophisticated tools for analysis and decision-making.