autograd.text package

Submodules

autograd.text.tokenizer module

class autograd.text.tokenizer.BytePairEncoder(num_merges: int = 500, vocab_file_path: str = 'vocab.pkl', encoded_data_path: str = 'bpe_encoded_data.npz', n_workers: int | None = None, min_word_freq: int = 10)

Bases: object

Byte Pair Encoder (BPE) for tokenizing text.

This class implements BPE, which merges the most frequent pairs of tokens iteratively to learn subword units. It allows encoding raw text into a list of integer token IDs and decoding token IDs back into text.

Examples

>>> raw_text = "Hello world! This is a test."
>>> bpe = BytePairEncoder(num_merges=50)
>>> encoded_array = bpe.prepare_data([raw_text]) # Outputs an array of token IDs
SPECIAL_TOKENS = ['<|endoftext|>', '<PAD>', '<SOS>', '<UNK>', '<|USER|>', '<|ASSISTANT|>', '<|SYSTEM|>', '<|END_OF_TURN|>', '<|TOOL|>', '<|TOOL_CALL|>', '<|TOOL_RESULT|>']
ENCODE_CHUNK_CACHE_MAX_SIZE = 50000
WORD_FREQ_BATCH_SIZE = 10000
WORD_FREQ_LOG_INTERVAL = 500000
property n_vocab: int

Number of tokens (including special tokens) in the vocabulary.

Examples

>>> bpe = BytePairEncoder(num_merges=50)
>>> print(bpe.n_vocab)  # Outputs the size of the vocabulary
Type:

int

prepare_data(texts: Iterable[str], overwrite_vocabulary_file: bool = False, overwrite_encoded_data: bool = False) ndarray

Trains and applies BPE on the given texts, returning encoded token IDs.

Convenience wrapper that trains the vocabulary then writes token IDs to the configured memory-mapped file. texts is consumed twice when both passes run — once to build word frequencies during training, once to encode — so it must be re-iterable (e.g. a list or an object whose __iter__ returns a fresh iterator each time). A bare generator will be exhausted before encoding starts.

Parameters:
  • texts (Iterable[str]) – Documents to train on and encode.

  • overwrite_vocabulary_file (bool) – If True, re-trains and overwrites the BPE vocabulary.

  • overwrite_encoded_data (bool) – If True, overwrites an existing encoded file.

Returns:

np.ndarray – A memory-mapped array of token IDs.

Example

>>> bpe = BytePairEncoder(num_merges=50)
>>> encoded_array = bpe.prepare_data(["Hello world! This is a test."])
>>> print(encoded_array)
[ ...some token IDs... ]
train_vocabulary(texts: Iterable[str], overwrite_saved_file: bool = False) Tuple[Dict[bytes, int], Dict[int, bytes]]

Trains the BPE vocabulary on the given text documents.

Streams text one document at a time, so the full corpus never needs to be held in memory. The learned merges and vocabulary are saved to vocab_file_path.

Parameters:
  • texts (Iterable[str]) – Documents to train on (e.g. [“doc1”, “doc2”] or a lazy generator). Pass a single document as [raw_text]. A bare str is rejected to prevent silent per-character iteration.

  • overwrite_saved_file (bool) – If True, re-trains and overwrites any existing vocabulary file.

Returns:

Tuple[Dict[bytes, int], Dict[int, bytes]] – A tuple containing: - A dictionary mapping byte sequences or special tokens to integer IDs. - A dictionary mapping integer IDs back to byte sequences.

Examples

>>> bpe = BytePairEncoder(num_merges=50)
>>> vocab, rev_vocab = bpe.train_vocabulary(["Hello world! Hello again!"])
>>> print(vocab)  # Prints the vocabulary mapping
encode(input_text: str) List[int]

Encodes a raw input string into a list of BPE token IDs.

The process is:
  1. Pre-tokenize the input into chunks (words, punctuation, special tokens).

  2. Convert each chunk to base (byte-level) token IDs.

  3. Greedily apply the highest-priority merge until no more merges apply.

Parameters:

input_text (str) – The text to encode.

Returns:

List[int] – The list of integer token IDs representing the encoded text.

Example

>>> bpe = BytePairEncoder(num_merges=50)
>>> bpe.train_vocabulary(["Hello world!"])
>>> token_ids = bpe.encode("Hello world!")
>>> print(token_ids)  # Outputs a list of token IDs
decode(encoded_tokens: List[int]) str

Decodes a sequence of BPE token IDs back into a string.

Parameters:

encoded_tokens (List[int]) – The list of integer token IDs to decode. Numpy arrays must be converted with .tolist() first; this keeps the dict lookup on the hot path free of scalar coercion.

Returns:

str – The decoded string.

Example

>>> bpe = BytePairEncoder(num_merges=50)
>>> bpe.train_vocabulary(["Hello world!"])
>>> token_ids = bpe.encode("Hello world!")
>>> text = bpe.decode(token_ids) # Expected: "Hello world!" (or similar)
static load_encoded(path: str) ndarray

Memory-maps a raw int32 token stream written by _encode_to_mmap.

The file has no header, so the token count is the file size divided by int32.itemsize (4 bytes per token).

autograd.text.utils module

class autograd.text.utils.GenerationResult(completion_tokens: list[int], logprobs: list[float], stop_reason: str)

Bases: object

Token-level output from autoregressive generation.

completion_tokens: list[int]
logprobs: list[float]
stop_reason: str
class autograd.text.utils.OpenWebTextSource(parquet_files: 'list[dict[str, Any]]', parquet_dir: 'str', start_token: 'str', split_token: 'str', parquet_shards_per_batch: 'int')

Bases: object

parquet_files: list[dict[str, Any]]
parquet_dir: str
start_token: str
split_token: str
parquet_shards_per_batch: int
autograd.text.utils.format_document_for_causal_lm(doc: str, *, start_token: str, split_token: str) str
autograd.text.utils.generate(model: nn.Module, prediction_func: nn.AbstractLLMForwardFn, prompt_tokens: List[int], max_new_tokens: int, temperature: float, top_k: int | None, eos_token_id: int, *, show_progress: bool = True, num_generations: int) list[GenerationResult]

Generate token ids autoregressively and record sampled-token logprobs.

This is the structured generation primitive: callers pass already-tokenized prompt ids, and the function owns the forward/sample loop. It returns only completion tokens, so callers can keep prompt and completion boundaries exact without decoding and re-encoding text.

Parameters:
  • model – Language model used for the forward pass.

  • prediction_func – Forward function called with mode=”sample”.

  • prompt_tokens – Token ids that seed generation.

  • max_new_tokens – Maximum number of completion tokens to generate.

  • temperature – Sampling temperature. Values <= 0 use argmax.

  • top_k – Optional top-k filter applied before sampling.

  • eos_token_id – Token id that stops generation when sampled.

  • show_progress – Whether to show token-level inference progress.

  • num_generations – Number of independent completions to generate in parallel for the same prompt.

Returns:

One result per generated completion.

autograd.text.utils.generate_text(model: nn.Module, prediction_func: nn.AbstractLLMForwardFn, bpe: BytePairEncoder, start_tokens: str | None, max_length: int = 50, temperature: float = 1.0, top_k: int | None = None) str

Generate and print text from a string prompt.

This is a convenience wrapper around generate: it handles tokenizer encode/decode, switches the model to eval mode for generation, restores the prior training mode, and streams decoded completion tokens to stdout.

Parameters:
  • model – Language model used for generation.

  • prediction_func – Forward function passed through to generate.

  • bpe – Tokenizer used to encode the prompt and decode generated tokens.

  • start_tokens – Prompt text. Defaults to “<SOS>” when omitted.

  • max_length – Maximum total token length, including prompt tokens.

  • temperature – Sampling temperature passed through to generate.

  • top_k – Optional top-k filter passed through to generate.

Returns:

Decoded prompt plus generated completion text.

autograd.text.utils.teacher_force(model: nn.Module, prediction_func: nn.AbstractLLMForwardFn, bpe: BytePairEncoder, groundtruth_data: Array, max_length: int = 50) str

Run teacher forcing over ground-truth token ids and print predictions.

At each step the model receives the ground-truth prefix and the decoded argmax prediction is appended to the returned text. This is intentionally separate from generate, because the model input is fixed by the dataset rather than by previously sampled tokens.

Parameters:
  • model – Language model used for the forward pass.

  • prediction_func – Forward function called with mode=”sample”.

  • bpe – Tokenizer used to decode predicted token ids.

  • groundtruth_data – Ground-truth token ids used as model inputs.

  • max_length – Maximum number of ground-truth tokens to evaluate.

Returns:

Decoded text from the model’s argmax predictions.

autograd.text.utils.create_vocabulary(texts: List[str], max_features: int | None, custom_tokenizer: Callable[[str], List[str]] | None = None, special_tokens: List[str] = ['<PAD>', '<SOS>', '<UNK>']) Dict[str, int]

Create a vocabulary (word->index) from given texts, keeping up to max_features most common words.

Examples

>>> texts = ["Hello world", "Hello there", "World peace"]
>>> vocab = create_vocabulary(texts, max_features=5)
>>> print(vocab)
{'hello': 0, 'world': 1, 'there': 2, 'peace': 3}  # Order and exact indices may vary.
autograd.text.utils.text_to_one_hot_and_sparse(texts: List[str], vocabulary: Dict[str, int], max_sequence_length: int, pad_str: str = '<PAD>') Tuple[Any, Any]

Convert list of texts into a sequential feature matrix using the vocabulary. It will do the padding/truncation based on max_sequence_length, then convert to one-hot encoding Shape: (batch_size, sequence_length, vocab_size)

Parameters:
  • texts (list of str) – The input sentences or documents.

  • vocabulary (dict) – A mapping of word -> index. We’ll also add “<PAD>” if it’s not already present.

  • max_sequence_length (int) – The maximum sequence length for truncation/padding.

  • pad_str (str) – The padding string.

Returns:

one_hot (Array) – shape (batch_size, max_sequence_length, vocab_size) matrix (Array): shape (batch_size, max_sequence_length) of integer IDs

Examples

>>> texts = ["Hello world", "Hello there"]
>>> vocab = {"hello": 0, "world": 1, "there": 2, "<PAD>": 3, "<UNK>": 4}
>>> one_hot, matrix = text_to_one_hot_and_sparse(texts, vocabulary=vocab, max_sequence_length=4)
>>> print(matrix)
[[0, 1, 3, 3],
 [0, 2, 3, 3]]
>>> print(one_hot.shape)
(2, 4, 5)
autograd.text.utils.create_causal_mask(seq_len: int, batch_size: int, lookback: bool = False, mask_diagonal: bool = False) Any

Creates a causal mask that prevents positions from attending to future (lookforward) or past (lookback) positions. 1.0 => masked.

Parameters:
  • seq_len (int) – Length of the sequence

  • batch_size (int) – Size of the batch

  • lookback (bool) – If True, masks “past” (i>j). If False, masks “future” (i<j).

  • mask_diagonal (bool) – If True, the main diagonal is also masked.

Returns:

Array – shape (batch_size, 1, seq_len, seq_len) with 1.0 in masked positions.

Examples

>>> mask = create_causal_mask(seq_len=5, batch_size=2)
>>> print(mask.shape)
(2, 1, 5, 5)
autograd.text.utils.prepare_mlx_attention_mask(mask: Tensor | Any | Sequence[Any] | int | float | bool | None, *, query_shape: Tuple[int, ...], key_shape: Tuple[int, ...]) Tuple[Literal['none', 'causal', 'explicit_bool', 'explicit_additive', 'dense_fallback'], Any | None]

Translate repo mask semantics into the narrower MLX attention contracts.

The MLX custom attention path uses this classification to decide whether it can take the optimized causal self-attention fast path or must fall back to the dense contract implementation.

Mask inputs intentionally accept either: - repo Tensor masks that already follow the dense additive-mask

contract (1.0 == forbidden, 0.0 == allowed)

  • raw backend arrays for bool-mask cases (True == keep, False == masked)

TODO: revisit this mixed Tensor/raw-array mask contract only if the repo adopts dtype-preserving Tensor semantics. Today, Tensor construction coerces data to float32, which would erase explicit-bool mask intent.

autograd.text.utils.clean_and_tokenize(text: str, pattern: str = '\\w+|[^\\w\\s]|[\\n\\s]', lowercase: bool = True) List[str]

Naive tokenizer split by words

Args:

text (str): The entire input text to be tokenized pattern (str): Regular expression pattern used for tokenization.

Default splits on words, punctuation and whitespace.

lowercase (bool): Whether to convert tokens to lowercase. Default True.

Returns:

list of tokens (str)

Examples:
>>> text = "Hello, world!
New line.”
>>> tokens = clean_and_tokenize(text)
>>> print(tokens)
['hello', ',', 'world', '!', 'new', 'line', '.']
autograd.text.utils.validate_batches(x: Any, y: Any) None
autograd.text.utils.token_batch_to_indices(token_batch: List[List[str]], vocab: Dict[str | bytes, int]) Any

Convert a batch of token lists to a matrix of token indices using a given vocabulary.

Parameters:
  • token_batch (List[List[str]]) – A list of tokenized sentences (each a list of strings).

  • vocab (Dict[Union[str, bytes], int]) – A vocabulary mapping tokens to integer indices.

Returns:

Array – A matrix of shape (batch_size, sequence_length) containing token indices.

Examples

>>> token_batch = [["hello", "world"], ["this", "test"]]
>>> vocab = {"hello": 0, "world": 1, "this": 2, "test": 3, b"<UNK>": 4}
>>> indices = token_batch_to_indices(token_batch, vocab)
>>> print(indices)
[[0, 1],
 [2, 3]]
autograd.text.utils.load_wiki_simple() str
autograd.text.utils.load_shakespeare_mini() str
autograd.text.utils.load_openwebtext(parquet_shards_per_batch: int = 1, *, start_token: str, split_token: str) OpenWebTextSource

Return a streaming OpenWebText source backed by public parquet shards.

Module contents