Decoding Strategies in Large Language Models – TechToday

The tokenizer, Byte-Pair Encoding in this case, translates each token in the input text into a corresponding token identifier. GPT-2 then uses these token IDs as input and tries to predict the next most likely token. Finally, the model generates logits, which are converted to probabilities using a softmax function.

For example, the model assigns a 17% probability to the token that “of” is the next token after “I have a dream.” This output essentially represents a sorted list of possible next tokens in the sequence. More formally, we denote this probability as P(of | I have a dream) = 17%.

Autoregressive models like GPT predict the next token in a sequence based on previous tokens. Consider a sequence of tiles w = ( ww…, w). The joint probability of this sequence P(w) can be broken down as:

For each witness wᵢ in the sequence, P(wᵢ | w₁, w₂, …, wᵢ₋₁) represents the conditional probability of wᵢ given all the previous tiles (w₁, w₂, …, wᵢ₋₁). GPT-2 calculates this conditional probability for each of the 50,257 tokens in its vocabulary.

This leads to the question: how do we use these probabilities to generate text? This is where decoding strategies such as greedy search and beam search come into play.

Greedy search is a decoding method that takes the most likely token at each step as the next in the sequence. Simply put, it only keeps the most likely token at each stage, discarding all other potential options. Using our example:

  • Step 1: Input: “I have a dream” → Most likely indicator: “of”
  • Step 2: Input: “I have a dream of” → Most likely Indicator: “be”
  • Step 3: Input: “I have a dream to be” → Most likely indicator: “a”
  • Step 4: Input: “I have a dream to be” → Most likely symbol: “doctor”
  • Step 5: Input: “I have a dream to be a doctor” → Most likely symbol: “.”

Although this approach may seem intuitive, it is important to note that greedy search is short-sighted: it only considers the most likely token at each step without considering the overall effect on the sequence. This property makes it fast and efficient since it doesn’t need to track multiple sequences, but it also means that it can miss better sequences that might have appeared with slightly less likely tiles.

Next, we illustrate the greedy search implementation using graphviz and networkx. We select the ID with the highest score, calculate its log probability (we take the log to simplify calculations), and add it to the tree. We will repeat this process for five tiles.

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import time

def get_log_prob(logits, token_id):
# Compute the softmax of the logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
log_probabilities = torch.log(probabilities)

# Get the log probability of the token
token_log_probability = log_probabilities[token_id].item()
return token_log_probability

def greedy_search(input_ids, node, length=5):
if length == 0:
return input_ids

outputs = model(input_ids)
predictions = outputs.logits

# Get the predicted next sub-word (here we use top-k search)
logits = predictions[0, -1, :]
token_id = torch.argmax(logits).unsqueeze(0)

# Compute the score of the predicted token
token_score = get_log_prob(logits, token_id)

# Add the predicted token to the list of input ids
new_input_ids =[input_ids, token_id.unsqueeze(0)], dim=-1)

# Add node and edge to graph
next_token = tokenizer.decode(token_id, skip_special_tokens=True)
current_node = list(graph.successors(node))[0]
graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
graph.nodes[current_node]['token'] = next_token + f"_length"

# Recursive call
input_ids = greedy_search(new_input_ids, current_node, length-1)

return input_ids

# Parameters
length = 5
beams = 1

# Create a balanced tree with height 'length'
graph = nx.balanced_tree(1, length, create_using=nx.DiGraph())

# Add 'tokenscore', 'cumscore', and 'token' attributes to each node
for node in graph.nodes:
graph.nodes[node]['tokenscore'] = 100
graph.nodes[node]['token'] = text

# Start generating text
output_ids = greedy_search(input_ids, 0, length=length)
output = tokenizer.decode(output_ids.squeeze().tolist(), skip_special_tokens=True)
print(f"Generated text: output")

Generated text: I have a dream of being a doctor.

Our greedy search generates the same text as the transformer library: “I have a dream to be a doctor.” Let’s visualize the tree we have created.

import matplotlib.pyplot as plt
import networkx as nx
import matplotlib.colors as mcolors
from matplotlib.colors import LinearSegmentedColormap

def plot_graph(graph, length, beams, score):
fig, ax = plt.subplots(figsize=(3+1.2*beams**length, max(5, 2+length)), dpi=300, facecolor="white")

# Create positions for each node
pos = nx.nx_agraph.graphviz_layout(graph, prog="dot")

# Normalize the colors along the range of token scores
if score == 'token':
scores = [data['tokenscore'] for _, data in graph.nodes(data=True) if data['token'] is not None]
elif score == 'sequence':
scores = [data['sequencescore'] for _, data in graph.nodes(data=True) if data['token'] is not None]
vmin = min(scores)
vmax = max(scores)
norm = mcolors.Normalize(vmin=vmin, vmax=vmax)
cmap = LinearSegmentedColormap.from_list('rg', ["r", "y", "g"], N=256)

# Draw the nodes
nx.draw_networkx_nodes(graph, pos, node_size=2000, node_shape="o", alpha=1, linewidths=4,
node_color=scores, cmap=cmap)

# Draw the edges
nx.draw_networkx_edges(graph, pos)

# Draw the labels
if score == 'token':
labels = node: data['token'].split('_')[0] + f"ndata['tokenscore']:.2f%" for node, data in graph.nodes(data=True) if data['token'] is not None
elif score == 'sequence':
labels = node: data['token'].split('_')[0] + f"ndata['sequencescore']:.2f" for node, data in graph.nodes(data=True) if data['token'] is not None
nx.draw_networkx_labels(graph, pos, labels=labels, font_size=10)

# Add a colorbar
sm =, norm=norm)
if score == 'token':
fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label="Token probability (%)")
elif score == 'sequence':
fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label="Sequence score")

# Plot graph
plot_graph(graph, length, 1.5, 'token')

Source link
At Ikaroa, we believe that large language models are shaping the future of technology. With its ability to generate sophisticated yet meaningful natural language outputs, these models are becoming increasingly important for tasks such as speech recognition and machine translation. However, the complexity of these models makes it difficult to decode strategies and approaches to effectively use large language models.

In the recent article “Decoding Strategies in Large Language Models – TechToday,” the authors examined the challenges associated with large language models. Firstly, they highlighted some popular methods, such as layer-wise decouple learning, fine-tuning and pre-training, used for large language models. The authors argue that these methods have their own pros and cons and should be studied in detail to understand their effectiveness.

The authors then proposed a new approach called “novelty detection”, which uses a model-based technique to evaluate the likely performance of a novel sentence and its relevance to the task at hand. This approach allows us to eliminate irrelevant sentences before training a large language model and achieve better performance.

Ikaroa has been at the forefront of exploring and decoding strategies for large language models. We have developed novel methods such as distributed training and automatic optimization tools that enable us to design better large language models and make them more efficient. We have leveraged our expertise to create powerful and versatile tools for natural language processing and machine translation that are able to generate sophisticated yet meaningful outputs.

“Decoding Strategies for Large Language Models” provides a comprehensive overview of the challenges and solutions associated with these models. As technology continues to evolve, big language models will play an increasingly important role in the development of better and more efficient technologies. We look forward to continuing our work on decoding strategies for large language models and creating better tools for the industry.


Leave a Reply

Your email address will not be published. Required fields are marked *