autograd package¶

autograd.tensor module¶

autograd.tensor.is_grad_enabled() → bool¶

autograd.tensor.no_grad()¶

class autograd.tensor.Function(*tensors: Tensor)¶

Bases: object

Base class for differentiable operations.

Subclasses of Function should implement the forward and backward methods to define the forward and backward passes of a particular operation. Some subclasses can be found in functional.py module

Examples

>>> # Example of a subclass (dummy function for demonstration)
>>> class DummyFunction(Function):
...     def forward(self, x):
...         return x + 1
...     def backward(self, grad):
...         return grad
>>> from autograd.tensor import Tensor
>>> import numpy as np
>>> x = Tensor(xp.array([1, 2, 3]))
>>> y = DummyFunction.apply(x) # Expected output: [2, 3, 4]

__init__(*tensors: Tensor)¶

Initialize a Function with a set of input tensors.

Parameters:: *tensors (Tensor) – The input tensors for this operation.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([1, 2, 3])
>>> f = Function(x)  # Although Function is abstract, this demonstrates the initializer.

abstractmethod forward(*args: Any, **kwargs: Any) → Any¶

Perform the forward pass of this operation.

This method should be overridden by subclasses to define the specific behavior of the operation. It receives NumPy arrays corresponding to the data of the input tensors.

Parameters:

*args (xp.ndarray) – Data arrays for the input tensors.
**kwargs (Any) – Additional keyword arguments.

Returns:

xp.ndarray – The result of the forward pass as a NumPy array.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

abstractmethod backward(grad: Tensor) → Any¶

Perform the backward pass of this operation.

This method should be overridden by subclasses to define how gradients are computed and propagated back to the input tensors.

In this context: - “grad” (the method argument) is the gradient of the loss function with respect to the output of this operation (dL/d[out]). - The return value should be the gradient of the loss function with respect to the input of this operation (dL/d[input]), so it can be passed further back along the computational graph.

Parameters:: grad (Tensor) – The gradient with respect to the output of this operation.
Returns:: xp.ndarray – The gradient with respect to the input(s).
Raises:: NotImplementedError – If this method is not implemented in a subclass.

classmethod apply(*tensors: Tensor, **kwargs: Any) → Tensor¶

Construct and apply this function to the given tensors.

This method: 1) Creates an instance of the function. 2) Extracts the .data from the input tensors to pass into the function’s forward method. 3) Wraps the result in a new Tensor that references this function (for backprop).

Parameters:

*tensors (Tensor) – Input tensors to the operation.
**kwargs (Any) – Additional keyword arguments passed to the forward method.

Returns:

Tensor – The resulting tensor after the forward operation.

static unbroadcast(grad_arr: Any, to_shape: Tuple[int, ...]) → Any¶

Sum out broadcasted dimensions so that grad_arr can match to_shape. Essentially the inverse of numpy’s broadcasting.

Parameters:

grad_arr (xp.ndarray) – Gradient array to unbroadcast.
to_shape (Tuple[int, ...]) – Shape to unbroadcast to.

Returns:

xp.ndarray – Unbroadcasted gradient array.

Examples

>>> import numpy as np
>>> grad_arr = xp.array([[1, 1], [1, 1]])
>>> unb = Function.unbroadcast(grad_arr, (1, 2))  # Expected output: [[2, 2]] (summing over the broadcasted dim)

__dict__ = mappingproxy({'__module__': 'autograd.tensor', '__doc__': '\n Base class for differentiable operations.\n\n Subclasses of `Function` should implement the `forward` and `backward` methods to define the\n forward and backward passes of a particular operation. Some subclasses can be found in `functional.py` module\n\n Examples:\n >>> # Example of a subclass (dummy function for demonstration)\n >>> class DummyFunction(Function):\n ... def forward(self, x):\n ... return x + 1\n ... def backward(self, grad):\n ... return grad\n >>> from autograd.tensor import Tensor\n >>> import numpy as np\n >>> x = Tensor(xp.array([1, 2, 3]))\n >>> y = DummyFunction.apply(x) # Expected output: [2, 3, 4]\n ', '__init__': <function Function.__init__>, 'forward': <function Function.forward>, 'backward': <function Function.backward>, 'apply': <classmethod(<function Function.apply>)>, 'unbroadcast': <staticmethod(<function Function.unbroadcast>)>, '__dict__': <attribute '__dict__' of 'Function' objects>, '__weakref__': <attribute '__weakref__' of 'Function' objects>, '__annotations__': {}})¶

__module__ = 'autograd.tensor'¶

__weakref__¶: list of weak references to the object

Bases: object

A Tensor is the core data structure of this autograd engine.

It holds an an optional reference to a creator function, and gradient information.

Examples

>>> import numpy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.array([1.0, 2.0, 3.0])) # Expected: xp.array([1., 2., 3.], dtype=float32)

Initialize a Tensor.

Parameters:

data (ArrayLike) – The data for this tensor. Python scalars/sequences are materialized as float32 by default, while explicit backend arrays keep their dtype.
creator (Optional[Function], optional) – The function that created this tensor. Defaults to None if this tensor is a leaf.
requires_grad (bool, optional) – Whether this tensor requires gradients. Defaults to True. Note that this default is independent of the no_grad() context; callers constructing leaf tensors inside no_grad() must still pass requires_grad=False explicitly if they want a non-grad leaf.

Examples

>>> x = Tensor([1, 2, 3]) # Expected: xp.array([1., 2., 3.], dtype=float32)

property data: Any¶

property grad: Tensor | None¶

Getter method of the gradient of this tensor.

The internal _grad is stored either as a Tensor or None. If it is stored as a NumPy it will be wrapped in a Tensor before returning.

Returns:: Optional[Tensor] – The gradient if it exists, or None.

Examples

>>> x = Tensor([1, 2, 3])
>>> print(x.grad)  # Expected: None

view(*shape: int | Tuple[int, ...]) → Tensor¶

Create a view of the tensor with the specified shape without copying the underlying data.

The new shape must be compatible with the total number of elements in the input tensor.

Raises:: ValueError – If more than one -1 is specified in the new shape or if the new shape does not match the input tensor’s total size.
Parameters:: *shape (int) – The desired shape. If -1 is present, it is inferred based on the remaining dimensions.
Returns:: Tensor – A new tensor that shares data with the original but is shaped differently.

Examples

>>> x = Tensor(xp.array([1, 2, 3, 4]))
>>> y = x.view(2, 2)
>>> print(y.data.shape)  # Expected: (2, 2)

static stack(tensors: List[Tensor], axis: int = 0) → Tensor¶

Stack a list of tensors along a new dimension. This operation joins a sequence of tensors by inserting a new axis at the specified position and concatenating along that axis.

Parameters:

tensors (List[Tensor]) – The list of tensors to stack.
axis (int, optional) – The dimension along which to stack. Defaults to 0.

Returns:

Tensor – A new tensor created by stacking.

Examples

>>> t1 = Tensor(xp.array([1, 2]))
>>> t2 = Tensor(xp.array([3, 4]))
>>> result = Tensor.stack([t1, t2], axis=0)  # Expected: [[1, 2], [3, 4]]

static cat(tensors: List[Tensor], axis: int = 0) → Tensor¶

Concatenate a list of tensors along the specified dimension. This operation concatenates the input tensors along the given axis.

Parameters:

tensors (List[Tensor]) – The list of tensors to concatenate.
axis (int, optional) – The dimension along which to concatenate. Defaults to 0.

Returns:

Tensor – The concatenated tensor.

Examples

>>> t1 = Tensor(xp.array([[1, 2]]))
>>> t2 = Tensor(xp.array([[3, 4]]))
>>> result = Tensor.cat([t1, t2], axis=0)  # Expected: [[1, 2], [3, 4]]

__add__(other: Tensor | float | int) → Tensor¶

Element-wise addition of two tensors (or a tensor and a scalar).

Parameters:: other (Union[Tensor, float, int]) – The tensor or scalar to add.
Returns:: Tensor – The result of addition.

__mul__(other: Tensor | float | int) → Tensor¶

Element-wise multiplication of two tensors (or a tensor and a scalar).

\[ z = x \cdot y \]

Parameters:: other (Union[Tensor, float, int]) – The tensor or scalar to multiply with.
Returns:: Tensor – The result of multiplication.

__matmul__(other: Tensor | float | int) → Tensor¶

Perform matrix multiplication (dot product) with another tensor.

For higher-dimensional tensors, xp.matmul broadcasting rules are followed.

Parameters:: other (Union[Tensor, float, int]) – The tensor or scalar to matmul with.
Returns:: Tensor – The result of matrix multiplication.

__pow__(other: Tensor | float | int) → Tensor¶

Compute the power operation $z = x^y$ with another tensor or scalar.

Parameters:: other (Union[Tensor, float, int]) – The exponent.
Returns:: Tensor – The result of the power operation.

Examples

>>> x = Tensor(xp.array([2, 3]))
>>> y = x ** 3 # Expected: [8, 27]

__iadd__(other: Tensor | float | int) → Tensor¶

In-place addition (self += other).

Broadcasting rules apply if shapes differ. This should maintain the computational graph while modifying the tensor in-place.

Parameters:: other (Union[Tensor, float, int]) – The tensor or scalar to add.
Returns:: Tensor – This tensor, after in-place addition.

__getitem__(idx: int | slice | tuple) → Tensor¶

Get a sliced or indexed view of the tensor.

Parameters:: idx (Union[int, slice, tuple]) – The index or slice.
Returns:: Tensor – A new tensor that shares data with the original.

__setitem__(idx: int | slice | tuple, value: Tensor | float | int) → Tensor¶

Set a portion of the tensor to a given value.

Parameters:

idx (Union[int, slice, tuple]) – The index or slice.
value (Union[Tensor, float, int]) – The value to set.

Returns:

Tensor – The same tensor after the in-place assignment.

astype(dtype: Any) → Tensor¶

sum(axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Tensor¶

Compute the sum of all elements (or along specified axis).

This function computes the sum of the input tensor elements along a specified axis or axes. If no axis is specified, all elements of the tensor are summed. Optionally, the reduced dimensions can be kept in the output tensor.

The summation is mathematically represented as:

\[ y = \sum_{i \in A} x_i \]

where A represents the specified axis or axes.

Parameters:

axis (int or tuple of ints, optional) – Axis or axes along which the sum is performed. If None, the sum of all elements is computed.
keepdims (bool, optional) – If True, the reduced axes are left in the result as dimensions with size one so that the result can be broadcast correctly against the input tensor.

Examples

For example:

Original tensor shape (3, 4, 5), axis (1, 2), keepdims True → result shape (3, 1, 1)
Original tensor shape (3, 4, 5), axis (1, 2), keepdims False → result shape (3,)
Original tensor shape (3, 4, 5), axis None, keepdims True → result shape (1,)
Original tensor shape (3, 4, 5), axis None, keepdims False → result shape ()

Parameters:

axis (Optional[Union[int, Tuple[int, ...]]], optional) – Axis or axes to sum over. If None, sums over all elements. Defaults to None.
keepdims (bool, optional) – Keep the reduced dimensions as size 1. Defaults to False.

Returns:

Tensor – The tensor with summed values.

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> s = x.sum(axis=0, keepdims=True) # Expected: [[4, 6]]

mean(axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Tensor¶

Compute the mean of elements (or along specified axis).

This function computes the mean of the input tensor elements along a specified axis or axes. If no axis is specified, the mean of all elements is computed. Optionally, the reduced dimensions can be kept in the output tensor.

The mean is mathematically defined as:

\[ y = \frac{1}{N} \sum_{i \in A} x_i \]

where A represents the specified axis or axes and N is the number of elements summed.

For example:

Original tensor shape (3, 4, 5), axis (1, 2), keepdims True → result shape (3, 1, 1)
Original tensor shape (3, 4, 5), axis (1, 2), keepdims False → result shape (3,)
Original tensor shape (3, 4, 5), axis None, keepdims True → result shape (1,)
Original tensor shape (3, 4, 5), axis None, keepdims False → result shape ()

Parameters:

axis (Optional[Union[int, Tuple[int, ...]]], optional) – Axis or axes to average over. If None, averages over all elements. Defaults to None.
keepdims (bool, optional) – Keep the reduced dimensions as size 1. Defaults to False.

Returns:

Tensor – The tensor with mean values.

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> m = x.mean(axis=0) # Expected: [2, 3]

max(axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Tensor¶

Compute the maximum value of elements (or along specified axis).

This function computes the maximum value of the input tensor along a specified axis or axes. If no axis is specified, the maximum over all elements is computed. Optionally, the reduced dimensions can be kept in the output tensor.

Mathematically, the maximum is computed as:

\[ y = \max_{i \in A} \; x_i \]

where A represents the specified axis or axes.

Parameters:

axis (Optional[Union[int, Tuple[int, ...]]], optional) – Axis or axes to compute max over. If None, computes global max. Defaults to None.
keepdims (bool, optional) – Keep the reduced dimensions as size 1. Defaults to False.

Returns:

Tensor – The tensor with maximum values.

Examples

>>> x = Tensor(xp.array([[1, 5], [3, 4]]))
>>> m = x.max(axis=0) # Expected: [3, 5]

gather(index: Any = 0) → Tensor¶

Gather rows from a 2D tensor using specified row indices.

This operation extracts rows from the input tensor corresponding to the given index or indices. It is particularly useful for selecting specific rows from a matrix, such as picking particular examples from a batch of data. When a single index is provided, it returns the corresponding row; when multiple indices are provided (e.g., as a list or tuple), it returns a new tensor composed of rows at those positions.

Parameters:: index (int or list/tuple of ints) – The row index or indices to gather from the tensor. Defaults to 0.
Returns:: Tensor – A new tensor containing the gathered rows.

Example

>>> tensor = Tensor([[10, 20], [30, 40], [50, 60]])
>>> gathered = tensor.gather([0, 2])
>>> print(gathered)
Tensor([[10, 20],
        [50, 60]])

sqrt() → Tensor¶

Compute the element-wise square root of the tensor.

Returns:: Tensor – The result of the sqrt operation.

Examples

>>> x = Tensor(xp.array([4, 9, 16]))
>>> y = x.sqrt() # Expected: [2, 3, 4]

maximum(other: Tensor | float | int) → Tensor¶

Element-wise maximum between two tensors or a tensor and a scalar.

This function performs an element-wise comparison between two input tensors and returns a new tensor containing the maximum value from each pair of elements. When both inputs are equal, the gradient is split equally between them.

Parameters:: other (Union[Tensor, float, int]) – The tensor or scalar to compare.
Returns:: Tensor – Element-wise maximum.

Examples

>>> x = Tensor(xp.array([1, 5, 3]))
>>> y = Tensor(xp.array([2, 4, 3]))
>>> z = x.maximum(y) # Expected: [2, 5, 3]

pad(pad_width: int | Tuple[int, int] | Tuple[int, int, int, int] | Tuple[Tuple[int, int], ...], mode: str = 'constant', constant_values: int | float = 0) → Tensor¶

Pad the tensor according to specified widths in each dimension.

This operation pads the input tensor using the given padding widths and mode. The interpretation of the pad_width argument is as follows:

If an int is provided, all dimensions are padded with that value.
If a tuple of 2 values is provided, it is interpreted as padding for the last dimension (PyTorch style): (pad_left, pad_right).
If a tuple of 4 values is provided, it is interpreted as padding for the last two dimensions: (pad_left, pad_right, pad_top, pad_bottom).
If a tuple of tuples is provided, each inner tuple specifies (pad_before, pad_after) for each dimension.

The padded values are determined by the specified mode (default is “constant”) and the constant value provided.

Parameters:

pad_width (int or tuple) – Specifies how much padding to add on each dimension.
mode (str, optional) – Padding mode. Defaults to “constant”.
constant_values (int or float, optional) – Fill value for constant padding. Defaults to 0.

Returns:

Tensor – The padded tensor.

Example

>>> tensor = Tensor([[1, 2], [3, 4]])
>>> padded_tensor = tensor.pad(pad_width=1, mode="constant", constant_values=0)
>>> print(padded_tensor)
Tensor([[0, 0, 0, 0],
        [0, 1, 2, 0],
        [0, 3, 4, 0],
        [0, 0, 0, 0]])

forward(data: Any) → None¶: Placeholder for forward logic if needed. Currently unused.

Compute gradients for all upstream nodes in the graph via backpropagation.

If grad is None, we treat the gradient as ones (like d(self)/d(self) = 1).

2. We then do a post-order traversal of the graph: gather all nodes that lead to this tensor and store them in a topologically sorted list. 3. Finally, we go through that list in reverse order to apply each node’s .backward(…), passing gradients back to the node’s inputs.

As a side effect, each ancestor Tensor accumulates its .grad field.

Parameters:: grad (Optional[Union[Tensor, ArrayLike]]) – The gradient w.r.t. this tensor’s output.

Examples

>>> # In a typical usage, backward() is invoked on the loss tensor.
>>> loss.backward()

property shape: Tuple[int, ...]¶

Return the shape of the underlying NumPy data.

Returns:: Tuple[int, …] – The shape of this tensor.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> print(x.shape)  # Expected: (2, 2)

reshape(*shape: int) → Tensor¶

Return a new tensor with the same data but a different shape. It is functionally similar to numpy’s reshape.

Parameters:: *shape (int) – The desired new shape.
Returns:: Tensor – A reshaped tensor.

Examples

>>> x = Tensor(xp.array([1, 2, 3, 4]))
>>> y = x.reshape(2,2)
>>> print(y.data.shape)  # Expected: (2, 2)

expand(*shape: int | Sequence[int]) → Tensor¶

Broadcast the tensor to a new shape without copying data.

This operation broadcasts the input tensor to a new shape. The forward pass creates a new mx with the specified shape (via broadcasting), and the backward pass reduces the gradient back to the shape of the original tensor.

Parameters:: *shape (Union[int, Sequence[int]]) – The target shape, which can be specified as multiple int arguments or as a single tuple/list.
Returns:: Tensor – A new tensor broadcast to the specified shape.

Example

>>> tensor = Tensor([1, 2, 3])
>>> expanded_tensor = tensor.expand(3, 3)
>>> print(expanded_tensor)
Tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])

permute(*dims: int) → Tensor¶

Reorder (permute) the dimensions of this tensor. .. rubric:: Examples

>>> import numpy as np
>>> from your_module import Tensor, Permute
>>> t = Tensor(xp.array([[1, 2], [3, 4]]))
>>> op = Permute()
>>> result = op.forward(t.data, dims=[1, 0])
>>> print(result)
[[1, 3],
[2, 4]]

Parameters:: *dims (int) – A sequence of dimension indices indicating the new order.
Returns:: Tensor – A new tensor with permuted dimensions.

Example

>>> tensor = Tensor([[1, 2], [3, 4]])
>>> permuted_tensor = tensor.permute(1, 0)
>>> print(permuted_tensor)
Tensor([[1, 3],
        [2, 4]])

transpose(dim0: int = 0, dim1: int = 1) → Tensor¶

Swap two dimensions of this tensor.

This operation swaps the positions of two specified dimensions of the input tensor. The backward pass applies the same transposition to the gradient, restoring the original dimension order.

Parameters:

dim0 (int, optional) – First dimension to swap. Defaults to 0.
dim1 (int, optional) – Second dimension to swap. Defaults to 1.

Returns:

Tensor – A new tensor with the specified dimensions swapped.

Examples

>>> tensor = Tensor(xp.array([[1, 2], [3, 4]]))
>>> t = tensor.transpose(0, 1)
>>> print(t.data)
[[1, 3],
 [2, 4]]

strided_windows(kernel_size: int, stride: int) → Tensor¶

Extract sliding windows of size kernel_size with stride stride.

This operation generates overlapping windows from the input tensor using the specified kernel size and stride. The output shape is given by:

\[ (H_{out}, W_{out}, batch\_size, channels, kernel\_size, kernel\_size) \]

where

\[\begin{split} \begin{align} H_{out} = \frac{height - kernel\_size}{stride} + 1 \\ W_{out} = \frac{width - kernel\_size}{stride} + 1 \end{align} \end{split}\]

Examples

>>> import numpy as np
>>> x = Tensor(xp.random.rand(2, 3, 10, 10))  # shape: (batch, channels, height, width)
>>> op = StridedWindows()
>>> windows = x.strided_windows(x, kernel_size=3, stride=1)
>>> print(windows.shape)
(8, 8, 2, 3, 3, 3)

Parameters:

kernel_size (int) – The size of each window.
stride (int) – The stride between windows.

Returns:

Tensor – A tensor representing the strided windows.

roll(shifts: int, dims: int) → Tensor¶

Roll tensor elements along a given dimension. This operation shifts the elements of the input tensor along the given dimension by the specified number of positions. Elements that roll beyond the last position reappear at the beginning.

Parameters:

shifts (int) – Number of places by which to shift.
dims (int) – Dimension along which to roll.

Returns:

Tensor – The rolled tensor.

Example

>>> tensor = Tensor(xp.array([1, 2, 3, 4, 5]))
>>> rolled_tensor = tensor.roll(shifts=2, dims=0)
>>> print(rolled_tensor.data)
[4, 5, 1, 2, 3]

detach() → Tensor¶

Detach this tensor from the computational graph, returning a new tensor with the same data but no gradient.

Returns:: Tensor – A new tensor that does not track gradients.

Examples

>>> x = Tensor(xp.array([1, 2, 3]))
>>> y = x.detach()
>>> y.requires_grad
False

item() → Any¶: Return this tensor as a host scalar.

numpy() → Any¶: Return this tensor as a host NumPy array.

property ndim: int¶

Return the number of dimensions of this tensor.

Returns:: int – The number of dimensions.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> print(x.ndim)  # Expected: 2

property T: Tensor¶

Convenience property to transpose a 2D tensor. For higher dimensions, use transpose() with explicit dims.

Returns:: Tensor – Transposed tensor.
Raises:: ValueError – If the tensor is not 2D.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> print(x.T.data)
[[1, 3],
 [2, 4]]

__radd__(other: Tensor | float | int) → Tensor¶

__rmul__(other: Tensor | float | int) → Tensor¶

__sub__(other: Tensor | float | int) → Tensor¶

__rsub__(other: Tensor | float | int) → Tensor¶

__truediv__(other: Tensor | float | int) → Tensor¶

__neg__() → Tensor¶

__repr__() → str¶

Return a string representation of the tensor, showing its data and gradient.

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> print(x)
Tensor(data=[1. 2. 3.], grad=None)

__lt__(other: Tensor | float | int) → Any | bool¶: Return self<value.

__le__(other: Tensor | float | int) → Any | bool¶: Return self<=value.

__gt__(other: Tensor | float | int) → Any | bool¶: Return self>value.

__ge__(other: Tensor | float | int) → Any | bool¶: Return self>=value.

__eq__(other: Tensor | float | int) → Any | bool¶: Return self==value.

__hash__() → int¶: Return hash(self).

__annotations__ = {'_grad': "Optional['Tensor']"}¶

__dict__ = mappingproxy({'__module__': 'autograd.tensor', '__doc__': '\n A `Tensor` is the core data structure of this autograd engine.\n\n It holds an an optional reference to a creator function, and gradient information.\n\n Examples:\n >>> import numpy as np\n >>> from autograd.tensor import Tensor\n >>> x = Tensor(xp.array([1.0, 2.0, 3.0])) # Expected: xp.array([1., 2., 3.], dtype=float32)\n ', '__init__': <function Tensor.__init__>, 'data': <property object>, 'grad': <property object>, 'view': <function Tensor.view>, 'stack': <staticmethod(<function Tensor.stack>)>, 'cat': <staticmethod(<function Tensor.cat>)>, '__add__': <function Tensor.__add__>, '__mul__': <function Tensor.__mul__>, '__matmul__': <function Tensor.__matmul__>, '__pow__': <function Tensor.__pow__>, '__iadd__': <function Tensor.__iadd__>, '_wrap_scalar_like_self': <function Tensor._wrap_scalar_like_self>, '__getitem__': <function Tensor.__getitem__>, '__setitem__': <function Tensor.__setitem__>, 'astype': <function Tensor.astype>, 'sum': <function Tensor.sum>, 'mean': <function Tensor.mean>, 'max': <function Tensor.max>, 'gather': <function Tensor.gather>, 'sqrt': <function Tensor.sqrt>, 'maximum': <function Tensor.maximum>, 'pad': <function Tensor.pad>, 'forward': <function Tensor.forward>, 'backward': <function Tensor.backward>, 'shape': <property object>, 'reshape': <function Tensor.reshape>, 'expand': <function Tensor.expand>, 'permute': <function Tensor.permute>, 'transpose': <function Tensor.transpose>, 'strided_windows': <function Tensor.strided_windows>, 'roll': <function Tensor.roll>, 'detach': <function Tensor.detach>, 'item': <function Tensor.item>, 'numpy': <function Tensor.numpy>, 'ndim': <property object>, 'T': <property object>, '_accumulate_grad': <function Tensor._accumulate_grad>, '__radd__': <function Tensor.__radd__>, '__rmul__': <function Tensor.__rmul__>, '__sub__': <function Tensor.__sub__>, '__rsub__': <function Tensor.__rsub__>, '__truediv__': <function Tensor.__truediv__>, '__neg__': <function Tensor.__neg__>, '__repr__': <function Tensor.__repr__>, '__lt__': <function Tensor.__lt__>, '__le__': <function Tensor.__le__>, '__gt__': <function Tensor.__gt__>, '__ge__': <function Tensor.__ge__>, '__eq__': <function Tensor.__eq__>, '__hash__': <function Tensor.__hash__>, '__dict__': <attribute '__dict__' of 'Tensor' objects>, '__weakref__': <attribute '__weakref__' of 'Tensor' objects>, '__annotations__': {'_grad': "Optional['Tensor']"}})¶

__module__ = 'autograd.tensor'¶

__weakref__¶: list of weak references to the object

autograd.tensor.checkpoint(run_function: Any, *tensors: Tensor) → Tensor¶

class autograd.tensor.Add(*tensors: Tensor)¶

Bases: Function

Element-wise addition of two tensors. See autograd.tensor.Tensor.__add__() function

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> y = Tensor(xp.array([4,5,6]))
>>> z = Add.apply(x, y) # Expected: [5, 7, 9]

forward(x: Any, y: Any) → Any¶

Compute the element-wise sum of two tensors.

Parameters:

x (xp.ndarray) – The first input tensor.
y (xp.ndarray) – The second input tensor.

Returns:

xp.ndarray – The element-wise sum of x and y.

backward(grad: Tensor) → Tuple[Any | None, Any | None]¶

Compute the gradient for the addition operation.

Since addition is linear, the gradient with respect to both inputs is the same as the incoming gradient.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[Optional[xp.ndarray], Optional[xp.ndarray]] – The gradients with respect to x and y.

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> y = Tensor(xp.array([4,5,6]))
>>> z = x + y
>>> # During backprop, the gradient for both x and y would be the same as grad.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Mul(*tensors: Tensor)¶

Bases: Function

Element-wise multiplication of two tensors. See autograd.tensor.Tensor.__mul__() function

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> y = Tensor(xp.array([4,5,6]))
>>> z = Mul.apply(x, y)  # Expected: [4, 10, 18]

forward(x: Any, y: Any) → Any¶

Compute the element-wise product of two tensors.

Parameters:

x (xp.ndarray) – The first input tensor.
y (xp.ndarray) – The second input tensor.

Returns:

xp.ndarray – The element-wise product of x and y.

backward(grad: Tensor) → Tuple[Any | None, Any | None]¶

Compute the gradient for the multiplication operation.

The gradients are computed as:

\[\begin{split} \begin{align} \frac{\partial z}{\partial x} = y \\ \frac{\partial z}{\partial y} = x \end{align} \end{split}\]

and then multiplied by the incoming gradient.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[Optional[xp.ndarray], Optional[xp.ndarray]] – The gradients with respect to x and y.

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> y = Tensor(xp.array([4,5,6]))
>>> z = x * y
>>> # Backpropagated gradients for x would be y, and for y would be x.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Pow(*tensors: Tensor)¶

Bases: Function

Element-wise power operation. See autograd.tensor.Tensor.__pow__() function

Examples

>>> x = Tensor(xp.array([2, 3]))
>>> y = x ** 3 # Expected: [8, 27]

forward(x: Any, y: Any) → Any¶

Compute the element-wise power operation.

Parameters:

x (xp.ndarray) – The base tensor.
y (xp.ndarray) – The exponent tensor.

Returns:

xp.ndarray – The result of raising x to the power y.

backward(grad: Tensor) → Tuple[Any | None, Any | None]¶

Compute the gradient for the power operation.

The derivatives are given by:

\[\begin{split} \begin{align} \frac{\partial (x^y)}{\partial x} = y \cdot x^{y-1} \\ \frac{\partial (x^y)}{\partial y} = x^y \cdot \ln(x) \end{align} \end{split}\]

These derivatives are multiplied by the incoming gradient.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[Optional[xp.ndarray], Optional[xp.ndarray]] – The gradients with respect to x and y.

Examples

>>> x = Tensor(xp.array([2, 3]))
>>> y = Tensor(xp.array([3, 2]))
>>> z = x ** y
>>> # Gradients for x: y * x^(y-1) and for y: x^y * ln(x)

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Matmul(*tensors: Tensor)¶

Bases: Function

Matrix multiplication of two tensors. See autograd.tensor.Tensor.__matmul__() function

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> y = Tensor(xp.array([[5, 6], [7, 8]]))
>>> z = Matmul.apply(x, y)  # Expected: [[19, 22], [43, 50]]

forward(x: Any, y: Any) → Any¶

Compute the matrix multiplication of two tensors.

The operation uses xp.matmul, which handles broadcasting and batching.

Parameters:

x (xp.ndarray) – The first tensor.
y (xp.ndarray) – The second tensor.

Returns:

xp.ndarray – The result of matrix multiplying x and y.

backward(grad: Tensor) → Tuple[Any | None, Any | None]¶

Compute the gradient for the matrix multiplication operation.

For matrix multiplication:

$z = x \cdot y$

the gradients are computed as:

$$ \text{grad}_x = \text{grad} \cdot y^T

\[\]

\text{grad}_y = x^T \cdot \text{grad} $$

Special handling is provided for the vector @ vector case and for batched multiplications.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[Optional[xp.ndarray], Optional[xp.ndarray]] – The gradients with respect to x and y.

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> y = Tensor(xp.array([[5, 6], [7, 8]]))
>>> z = x @ y
>>> # Backpropagation would compute gradients:
>>> grad_x, grad_y = Matmul.apply(x, y).creator.backward(Tensor(xp.ones_like(z.data)))

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.IAdd(*tensors: Tensor)¶

Bases: Function

In-place addition of two tensors. See autograd.tensor.Tensor.__iadd__() function

Examples

>>> x = Tensor(xp.array([1, 2, 3]))
>>> y = Tensor(xp.array([4, 5, 6]))
>>> IAdd.apply(x, y)  # Expected: [5, 7, 9]

forward(x: Any, y: Any) → Any¶

Perform in-place addition on the input tensor.

Parameters:

x (xp.ndarray) – The tensor to be updated.
y (xp.ndarray) – The tensor to add.

Returns:

xp.ndarray – The updated tensor x after addition.

backward(grad: Tensor) → Tuple[Any, Any]¶

Compute the gradient for the in-place addition operation.

Both inputs receive the same gradient as in the standard addition.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[xp.ndarray, xp.ndarray] – The gradients with respect to x and y.

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> y = Tensor(xp.array([4,5,6]))
>>> z = x  # In-place addition: x += y
>>> z = IAdd.apply(x, y)
>>> # Gradients for both x and y would be identical to grad.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.GetItem(*tensors: Tensor)¶

Bases: Function

Retrieve an item from a tensor using numpy-style indexing. See autograd.tensor.Tensor.__getitem__() function

Examples

>>> x = Tensor(xp.array([10, 20, 30]))
>>> y = GetItem.apply(x, idx=1)  # Expected: 20

forward(x: Any, idx: Any) → Any¶

Return a subset of the tensor based on the specified index.

Parameters:

x (xp.ndarray) – The input tensor.
idx (Any) – The index used to retrieve a subset of x (e.g., slices, integers).

Returns:

xp.ndarray – The indexed subset of the tensor.

backward(grad: Tensor) → Any¶

Propagate gradients through the indexing operation.

A zero tensor of the original shape is created and the gradient is placed in the correct location corresponding to the index.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: xp.ndarray – The gradient with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([10, 20, 30]))
>>> y = x[1]
>>> # During backprop, only index 1 receives the gradient.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.SetItem(*tensors: Tensor)¶

Bases: Function

In-place assignment to a tensor using numpy-style indexing. See autograd.tensor.Tensor.__setitem__() function

Examples

>>> x = Tensor(xp.array([1, 2, 3]))
>>> SetItem.apply(x, Tensor(10, requires_grad=False), idx=1) # Expected: [1, 10, 3]

forward(x: Any, value: Any, idx: Any) → Any¶

Perform in-place assignment on the input tensor.

Parameters:

x (xp.ndarray) – The input tensor.
idx (Any) – The indices at which to assign the new value.
value (xp.ndarray) – The value to assign.

Returns:

xp.ndarray – The tensor after assignment.

backward(grad: Tensor) → Any¶

Compute the gradient for the in-place assignment operation.

The gradient is extracted only from the region specified by the index.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: xp.ndarray – The gradient with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([1, 2, 3]))
>>> _ = SetItem.apply(x, Tensor(10, requires_grad=False), idx=1)
>>> # During backprop, only index 1 will contribute to the gradient.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Cast(*tensors: Tensor)¶

Bases: Function

forward(x: Any, dtype: Any) → Any¶

Perform the forward pass of this operation.

This method should be overridden by subclasses to define the specific behavior of the operation. It receives NumPy arrays corresponding to the data of the input tensors.

Parameters:

*args (xp.ndarray) – Data arrays for the input tensors.
**kwargs (Any) – Additional keyword arguments.

Returns:

xp.ndarray – The result of the forward pass as a NumPy array.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

backward(grad: Tensor) → Any¶

Perform the backward pass of this operation.

This method should be overridden by subclasses to define how gradients are computed and propagated back to the input tensors.

In this context: - “grad” (the method argument) is the gradient of the loss function with respect to the output of this operation (dL/d[out]). - The return value should be the gradient of the loss function with respect to the input of this operation (dL/d[input]), so it can be passed further back along the computational graph.

Parameters:: grad (Tensor) – The gradient with respect to the output of this operation.
Returns:: xp.ndarray – The gradient with respect to the input(s).
Raises:: NotImplementedError – If this method is not implemented in a subclass.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Sqrt(*tensors: Tensor)¶

Bases: Function

Compute the element-wise square root of a tensor. See autograd.tensor.Tensor.sqrt() function

Examples

>>> x = Tensor(xp.array([4, 9, 16]))
>>> y = Sqrt.apply(x) # Expected: [2, 3, 4]

forward(x: Any) → Any¶

Compute the square root of each element in the input tensor.

Parameters:: x (xp.ndarray) – The input tensor.
Returns:: xp.ndarray – The element-wise square root of x.

backward(grad: Tensor) → Any¶

Compute the gradient for the square root operation.

The derivative of the square root is given by:

\[ \frac{d}{dx}\sqrt{x} = \frac{1}{2\sqrt{x}} \]

The gradient is computed by multiplying the incoming gradient grad by this derivative.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([4, 9, 16]))
>>> y = x.sqrt()
>>> # If grad is ones, the gradient should be 0.5 / sqrt(x)

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Sum(*tensors: Tensor)¶

Bases: Function

Compute the sum of tensor elements. See autograd.tensor.Tensor.sum() function

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> s = Sum.apply(x, axis=0, keepdims=True)  # Expected: Tensor with data [[4, 6]]

forward(x: Any, axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Any¶

Compute the forward pass for the sum operation.

Parameters:

x (xp.ndarray) – Input tensor.
axis (int or tuple of ints, optional) – Axis or axes along which the sum is performed. If None, the sum of all elements is computed.
keepdims (bool, optional) – If True, the reduced axes are kept in the output as dimensions with size one.

Returns:

xp.ndarray – The sum of the tensor elements.

backward(grad: Tensor) → Any¶

Compute the backward pass for the sum operation.

This method computes the gradient of the sum operation by broadcasting the gradient to the shape of the input tensor.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output of the sum operation.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> s = x.sum(axis=0)
>>> # During backprop, the gradient is broadcast back to shape (2,2)

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Max(*tensors: Tensor)¶

Bases: Function

See autograd.tensor.Tensor.max() function

Examples

>>> x = Tensor(xp.array([[1, 5], [3, 4]]))
>>> m = Max.apply(x, axis=0, keepdims=True) # Expected: [[3, 5]]

forward(x: Any, axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Any¶

Compute the maximum of tensor elements. :param x: Input tensor. :type x: xp.ndarray :param axis: Axis or axes along which the maximum is computed. If None (default), the maximum of all elements is computed. :type axis: int or tuple of ints, optional :param keepdims: If True, the reduced axes are kept in the output as dimensions with size one. :type keepdims: bool, optional

Returns:: xp.ndarray – The maximum values computed along the specified axis.

backward(grad: Tensor) → Any¶

Compute the gradient of the maximum operation.

The backward pass for the maximum operation is computed using the chain rule:

\[ \frac{\partial \text{loss}}{\partial x} = \frac{\partial \text{loss}}{\partial \max(x)} \cdot \frac{\partial \max(x)}{\partial x} \]

where

\[\begin{split} \frac{\partial \max(x)}{\partial x} = \begin{cases} 1, & \text{if } x = \max(x) \\ 0, & \text{otherwise} \end{cases} \end{split}\]

In cases where multiple elements are equal to the maximum, the gradient is distributed equally or assigned to the first occurrence along the specified axis.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output of the maximum operation.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([[1, 5], [3, 4]]))
>>> m = x.max(axis=0, keepdims=True)
>>> # During backprop, gradient is distributed to the positions where x equals the max.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Maximum(*tensors: Tensor)¶

Bases: Function

Compute the element-wise maximum between two tensors. See autograd.tensor.Tensor.maximum() function

Examples

>>> x = Tensor(xp.array([1, 5, 3]))
>>> y = Tensor(xp.array([2, 4, 3]))
>>> z = Maximum.apply(x, y)  # Expected: [2, 5, 3]

forward(x: Any, y: Any) → Any¶

Compute the element-wise maximum of two tensors.

Parameters:

x (xp.ndarray) – First input tensor.
y (xp.ndarray) – Second input tensor.

Returns:

xp.ndarray – The element-wise maximum of the two input tensors.

backward(grad: Tensor) → Tuple[Any | None, Any | None]¶

Compute the gradient of the element-wise maximum operation.

During the backward pass, the gradient is distributed to the inputs according to the rule:

\[\begin{split} \frac{\partial \text{loss}}{\partial x} = \text{grad} \times \begin{cases} 1, & \text{if } x > y \\ 0.5, & \text{if } x = y \\ 0, & \text{otherwise} \end{cases} \end{split}\]

and similarly for $y$.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output of the maximum operation.
Returns:: Tuple[Optional[xp.ndarray], Optional[xp.ndarray]] – Gradients of the loss with respect to the input tensors x and y.

Examples

>>> x = Tensor(xp.array([1, 5, 3]))
>>> y = Tensor(xp.array([2, 4, 3]))
>>> z = x.maximum(y)
>>> # Backpropagation would distribute the gradient to x and y based on the maximum rule.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Mean(*tensors: Tensor)¶

Bases: Function

Compute the mean of tensor elements. See autograd.tensor.Tensor.mean() function

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> m = Mean.apply(x, axis=0) # Expected: [2, 3]

forward(x: Any, axis: int | Tuple[int, ...] | None = None, keepdims: bool = False) → Any¶

Compute the forward pass for the mean operation.

Parameters:

x (xp.ndarray) – Input tensor.
axis (int or tuple of ints, optional) – Axis or axes along which the mean is computed. If None, the mean of all elements is computed.
keepdims (bool, optional) – If True, the reduced axes are retained in the output as dimensions with size one.

Returns:

xp.ndarray – The mean of the tensor elements.

backward(grad: Tensor) → Any¶

Compute the gradient of the mean operation.

The gradient is computed by broadcasting the gradient to the shape of the input tensor and scaling it by the number of elements that were averaged:

\[ \frac{\partial \text{loss}}{\partial x} = \frac{\text{grad}}{N} \]

where N is the number of elements over which the mean was computed.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output of the mean operation.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> m = x.mean()
>>> # During backprop, the gradient is divided by the number of elements.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Gather(*tensors: Tensor)¶

Bases: Function

Gather operation for 2D tensors along axis 0 using integer indices. See autograd.tensor.Tensor.gather() function

Examples

>>> x = Tensor(xp.array([[10, 20], [30, 40], [50, 60]]))
>>> g = Gather.apply(x, index=xp.array([0, 2]))
>>> print(g.data)
[[10, 20],
 [50, 60]]

forward(x: Any, index: Any) → Any¶

Perform the forward pass of the gather operation.

Parameters:

x (xp.ndarray) – The input 2D tensor.
index (xp.ndarray) – An array of integer indices specifying the rows to gather.

Returns:

xp.ndarray – A tensor containing the gathered rows.

backward(grad: Tensor) → Tuple[Any, None]¶

Perform the backward pass of the gather operation.

The backward pass accumulates the gradients from the output back into the corresponding rows of the input tensor using numpy’s in-place addition.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: Tuple[xp.ndarray, None] – A tuple where the first element is the gradient with respect to the input tensor, and the second element is None. (since indices are not differentiable).

Examples

>>> x = Tensor(xp.array([[10, 20], [30, 40], [50, 60]]))
>>> g = x.gather(xp.array([0, 2]))
>>> # Backpropagated gradient will be placed in rows 0 and 2 of a zero tensor.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.View(*tensors: Tensor)¶

Bases: Function

View the tensor with a new shape without copying data. See autograd.tensor.Tensor.view() function

Examples

>>> x = Tensor(xp.array([1, 2, 3, 4]))
>>> v = View.apply(x, new_shape=(2,2))
>>> print(v.data.shape)  # Expected: (2, 2)

forward(x: Any, new_shape: Tuple[int, ...] | List[int] = (1,)) → Any¶

Reshape the input tensor to a new view with the specified shape.

Parameters:

x (xp.ndarray) – The input tensor.
new_shape (Union[Tuple[int, ...], List[int]], optional) – The desired new shape. If a -1 is present, that dimension is inferred from the size of the input tensor. Defaults to (1,).

Returns:

xp.ndarray – A view of the tensor with the specified new shape.

Raises:

ValueError – If more than one -1 is specified or if the new shape is incompatible with the total number of elements in x.

backward(grad: Tensor | None) → Any | None¶

Reshape the gradient to match the original tensor shape.

Parameters:: grad (Tensor, optional) – The gradient of the loss with respect to the output.
Returns:: Optional[xp.ndarray] – The gradient reshaped to the original tensor shape, or None if grad is None.

Examples

>>> x = Tensor(xp.array([1,2,3,4]))
>>> v = x.view(2,2)
>>> # Backward pass would reshape grad from (2,2) back to original shape.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Expand(*tensors: Tensor)¶

Bases: Function

Expand the tensor to a given shape without copying data (broadcasting). See autograd.tensor.Tensor.expand() function

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> e = Expand.apply(x, shape=(3,3)) # Expected: array where each row is [1,2,3]

forward(x: Any, shape: Tuple[int, ...] | List[int] = (1,)) → Any¶

Broadcast the input tensor to the specified shape.

Parameters:

x (xp.ndarray) – The input tensor.
shape (Union[Tuple[int, ...], List[int]], optional) – The target shape for broadcasting. Defaults to (1,).

Returns:

xp.ndarray – A new tensor broadcast to the specified shape.

backward(grad: Tensor) → Any¶

Compute the gradient of the expand operation.

The gradient is reduced by summing over the broadcast dimensions so that its shape matches the original input tensor.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([1,2,3]))
>>> e = x.expand(3, 3)
>>> # Backward pass would sum gradients over broadcasted dimensions.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Reshape(*tensors: Tensor)¶

Bases: Function

Reshape the tensor to a new shape without changing its data content. See autograd.tensor.Tensor.reshape() function

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> r = Reshape.apply(x, shape=(4,))
>>> print(r.data)  # Expected: [1, 2, 3, 4]

forward(x: Any, shape: Tuple[int, ...] | List[int] = (1,)) → Any¶

Reshape the input tensor to the specified new shape.

Parameters:

x (xp.ndarray) – The input tensor.
shape (Union[Tuple[int, ...], List[int]], optional) – The new shape for the tensor. If a nested tuple or list is provided, it will be flattened. Defaults to (1,).

Returns:

xp.ndarray – The reshaped tensor.

backward(grad: Tensor | None) → Any | None¶

Reshape the gradient to match the original tensor shape.

Parameters:: grad (Tensor, optional) – The gradient of the loss with respect to the reshaped output.
Returns:: Optional[xp.ndarray] – The gradient reshaped to the original tensor shape, or None if grad is None.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> r = x.reshape(4)
>>> # Backward would reshape grad from shape (4,) to (2,2)

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Transpose(*tensors: Tensor)¶

Bases: Function

Transpose operation for swapping any two dimensions of a tensor. See autograd.tensor.Tensor.transpose() function

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> t = Transpose.apply(x, dim0=0, dim1=1)
>>> print(t.data)  # Expected: [[1, 3], [2, 4]]

forward(x: Any, dim0: int = 0, dim1: int = 1) → Any¶

Transpose the input tensor by swapping two specified dimensions.

Parameters:

x (xp.ndarray) – The input tensor.
dim0 (int, optional) – The first dimension to swap. Defaults to 0.
dim1 (int, optional) – The second dimension to swap. Defaults to 1.

Returns:

xp.ndarray – The transposed tensor.

Raises:

ValueError – If the specified dimensions are out of range for the input tensor.

backward(grad: Tensor) → Any¶

Transpose the gradient tensor to match the original input tensor’s dimension order.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the output tensor.
Returns:: xp.ndarray – The gradient with dimensions swapped back to their original order.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> t = x.transpose(0,1)
>>> # Backward would apply the inverse transpose.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Pad(*tensors: Tensor)¶

Bases: Function

Pad the tensor with a specified padding. See autograd.tensor.Tensor.pad() function

Examples

>>> x = Tensor(xp.array([[1, 2], [3, 4]]))
>>> p = Pad.apply(x, pad_width=1, mode="constant", constant_values=0) # Expected padded tensor with zeros around the original data.

forward(x: Any, pad_width: int | Tuple[int, int] | Tuple[int, int, int, int] | Tuple[Tuple[int, int], ...], mode: str = 'constant', constant_values: int | float = 0) → Any¶

Pad the input tensor according to the specified pad width and mode.

Parameters:

x (xp.ndarray) – The input tensor.
pad_width (int or tuple) – Padding specification. See class docstring for details.
mode (str, optional) – Padding mode. Defaults to “constant”.
constant_values (int or float, optional) – Value for constant padding. Defaults to 0.

Returns:

xp.ndarray – The padded tensor.

backward(grad: Tensor) → Any¶

Extract the unpadded region from the gradient.

This method removes the padding from the gradient tensor, returning only the region corresponding to the original input tensor.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the padded output.
Returns:: xp.ndarray – The gradient corresponding to the unpadded input.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> p = x.pad(1, mode="constant", constant_values=0)
>>> # During backprop, the gradient will be unpadded to match x.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Cat(*tensors: Tensor)¶

Bases: Function

Concatenate a sequence of tensors along a specified axis. See autograd.tensor.Tensor.cat() function

Examples

>>> x = Tensor(xp.array([[1, 2]]))
>>> y = Tensor(xp.array([[3, 4]]))
>>> c = Cat.apply(x, y, axis=0) # Expected: [[1,2],[3,4]]

forward(*tensors: Any, axis: int = 0) → Any¶

Concatenate input tensors along the specified axis.

Parameters:

*tensors (Tensor) – A sequence of tensors to concatenate.
axis (int, optional) – The axis along which to concatenate. Defaults to 0.

Returns:

xp.ndarray – The concatenated tensor.

backward(grad: Tensor) → Tuple[Any | None, ...]¶

Split the gradient among the concatenated tensors.

The gradient is divided along the concatenation axis based on the original shapes of the input tensors.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the concatenated output.
Returns:: Tuple[Optional[xp.ndarray], …] – A tuple of gradients corresponding to each input tensor.

Examples

>>> x = Tensor(xp.array([[1, 2]]))
>>> y = Tensor(xp.array([[3, 4]]))
>>> c = Tensor.cat([x, y], axis=0)
>>> # Backward pass would split the gradient along axis 0.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Permute(*tensors: Tensor)¶

Bases: Function

Reorder the dimensions of a tensor. See autograd.tensor.Tensor.permute() function

Examples

>>> x = Tensor(xp.array([[[1,2],[3,4]], [[5,6],[7,8]]]))  # shape (2,2,2)
>>> p = Permute.apply(x, dims=(1,0,2))
>>> print(p.data.shape)  # Expected shape: (2,2,2) with dimensions permuted.

forward(x: Any, dims: Sequence[int]) → Any¶

Permute the dimensions of the input tensor.

Parameters:

x (xp.ndarray) – The input tensor.
dims (Sequence[int]) – The new order of dimensions. If a single element that is a tuple or list is provided, it will be unpacked.

Returns:

xp.ndarray – The tensor with permuted dimensions.

backward(grad: Tensor) → Any¶

Compute the gradient for the permutation operation.

The gradient is transposed using the inverse permutation of the forward pass.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the permuted output.
Returns:: xp.ndarray – The gradient with dimensions restored to their original order.

Examples

>>> x = Tensor(xp.array([[1,2],[3,4]]))
>>> p = x.permute(1,0)
>>> # Backward would transpose the gradient back.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Stack(*tensors: Tensor)¶

Bases: Function

Stack a sequence of tensors along a new axis. See autograd.tensor.Tensor.stack() function

Examples

>>> x = Tensor(xp.array([1,2]))
>>> y = Tensor(xp.array([3,4]))
>>> s = Stack.apply(x, y, axis=0) # Expected: [[1,2],[3,4]]

forward(*tensors: Any, axis: int = 0) → Any¶

Stack input tensors along a new axis.

This method expands the dimensions of each input tensor along the specified axis and concatenates them.

Parameters:

*tensors (Tensor) – A sequence of tensors to be stacked.
axis (int, optional) – The axis along which to stack the tensors. Defaults to 0.

Returns:

xp.ndarray – The stacked tensor.

Raises:

ValueError – If no tensors are provided.

backward(grad: Tensor) → Tuple[Any | None, ...]¶

Split the gradient among the stacked tensors.

The gradient is divided along the stacking axis and reshaped to match each input tensor’s original shape.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the stacked tensor.
Returns:: Tuple[Optional[xp.ndarray], …] – A tuple of gradients corresponding to each input tensor.

Examples

>>> x = Tensor(xp.array([1,2]))
>>> y = Tensor(xp.array([3,4]))
>>> s = Tensor.stack([x, y], axis=0)
>>> # Backward pass would split grad along the new axis.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.StridedWindows(*tensors: Tensor)¶

Bases: Function

Create a strided windows view of a tensor. See autograd.tensor.Tensor.strided_windows() function

Examples

>>> x = Tensor(xp.random.rand(2, 3, 10, 10))
>>> windows = StridedWindows.apply(x, kernel_size=3, stride=1)
>>> print(windows.shape)
(8, 8, 2, 3, 3, 3)

forward(x: Any, kernel_size: int, stride: int) → Any¶

Create a strided windows view of the input tensor.

Parameters:

x (xp.ndarray) – The input tensor of shape (batch_size, channels, height, width).
kernel_size (int) – The size of each window.
stride (int) – The stride between windows.

Returns:

xp.ndarray – A view of the tensor with shape

$(H_{out}, W_{out}, batch\_size, channels, kernel\_size, kernel\_size)$, where $H_{out} = \frac{height - kernel\_size}{stride} + 1$ and $W_{out} = \frac{width - kernel\_size}{stride} + 1$.

backward(grad: Tensor) → Any¶

Reconstruct the gradient for the input tensor from the strided windows gradient.

This method reshapes and transposes the gradient of the strided windows view back to the original input tensor shape by accumulating overlapping gradients.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the strided windows output.
Returns:: xp.ndarray – The gradient of the loss with respect to the original input tensor.

Examples

>>> x = Tensor(xp.random.rand(2, 3, 10, 10))
>>> windows = x.strided_windows(kernel_size=3, stride=1)
>>> # Backward pass would accumulate gradients from overlapping windows.

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

class autograd.tensor.Roll(*tensors: Tensor)¶

Bases: Function

Roll tensor elements along a specified dimension. See autograd.tensor.Tensor.roll() function

Examples

>>> x = Tensor(xp.array([1,2,3,4,5]))
>>> r = Roll.apply(x, shifts=2, dims=0)  # Expected: [4,5,1,2,3]

__annotations__ = {}¶

__module__ = 'autograd.tensor'¶

forward(x: Any, shifts: int, dims: int | None = None) → Any¶

Roll the elements of the input tensor.

Parameters:

x (xp.ndarray) – The input tensor.
shifts (int) – The number of positions to shift the elements.
dims (int, optional) – The axis along which to roll the elements. If None, the tensor is flattened before rolling.

Returns:

xp.ndarray – The tensor with its elements rolled along the specified dimension.

backward(grad: Tensor) → Any¶

Perform the backward pass for the roll operation.

The gradient is rolled in the opposite direction (by negating the shift) to reverse the forward roll.

Parameters:: grad (Tensor) – The gradient of the loss with respect to the rolled output.
Returns:: xp.ndarray – The gradient of the loss with respect to the input tensor.

Examples

>>> x = Tensor(xp.array([1,2,3,4,5]))
>>> r = x.roll(shifts=2, dims=0)
>>> # Backward would roll grad in the opposite direction.

autograd.nn module¶

class autograd.nn.Module(*args, **kwargs)¶

Bases: object

Base class for all neural network modules.

This class provides mechanisms for registering parameters, submodules, and states, and implements common functionality such as zero_grad, forward, and state dict management.

Note that we don’t implement the backward() function in this Module class, because all the backward() functions are implemented at the tensor-level operations. And the forward functions are just piecing together tensor-level operations like lego.

_parameters¶

Dictionary of trainable parameters.

Type:: Dict[str, Tensor]

_modules¶

Dictionary of submodules.

Type:: Dict[str, Module]

_states¶

Dictionary of non-trainable states/buffers.

Type:: Dict[str, Any]

_is_training¶

Flag indicating training mode.

Type:: Optional[bool]

Examples

>>> # Define a simple custom module by subclassing Module.
>>> class MyModule(Module):
...     def forward(self, x):
...         return x * 2
>>> module = MyModule()
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> input_tensor = Tensor(xp.array([1, 2, 3]))
>>> output = module(input_tensor) # Expected output: [2, 4, 6]

zero_grad() → None¶

Zero the gradients for all parameters in the module and its submodules.

Examples

>>> # Assuming module has trainable parameters with gradients.
>>> module.zero_grad()

abstractmethod forward(x: Any) → Tensor¶

Perform the forward pass.

Parameters:: x (Any) – Input data.
Returns:: Tensor – The output tensor after going through the forward pass
Raises:: NotImplementedError – If the method is not overridden by a subclass.

Examples

>>> class MyModule(Module):
...     def forward(self, x):
...         return x + 1
>>> module = MyModule()
>>> from autograd.tensor import Tensor
>>> import cupy as np
>>> x = Tensor(xp.array([1, 2, 3]))
>>> y = module(x) # Expected: [2, 3, 4]

apply(func: Callable) → None¶

Apply a function recursively to every submodule in-place. This can be useful for dynamically adjusting the gradient or parameters of the model. E.g. Clipping gradient norms, setting parameters to a specific value, etc.

Parameters:: func (Callable) – A function that takes a Module and applies some operation.

Examples

>>> # Example: print the type of each module.
>>> def print_module(m):
...     print(type(m))
>>> module.apply(print_module)

property parameters: Dict[str, Any]¶

Get a flattened dictionary of all trainable parameters from the module and its submodules.

{
    "weight": "Tensor",
    "submodule1.weight": "Tensor"
}

Returns:: Dict[str, Any] – A dictionary mapping parameter names to Tensor objects.

Examples

>>> # Assuming module has parameters 'weight' and a submodule with 'bias'
>>> params = module.parameters
>>> print(params.keys())

property states: Dict[str, Any]¶

Get a flattened dictionary of all non-trainable states or buffers from the module and its submodules.

{
    "some_state": "array",
    "submodule1.running_var": "array"
}

Returns:: Dict[str, Any] – A dictionary mapping state names to their values.

Examples

>>> # Assuming module has a state 'running_mean' in a BatchNorm submodule.
>>> states = module.states
>>> print(states)

state_dict() → Dict[str, Dict[str, Any]]¶

Return a state dictionary of the module.

The state dictionary contains two keys:

“parameters”: A dictionary mapping parameter names to their raw arrays.
“states”: A dictionary mapping state names to their values.

Returns:: Dict[str, Dict[str, Any]] – The state dictionary.

{
    "parameters": { "weight": "array", "bias": "array" },
    "states": { "stateful_states": "array" }
}

Examples

>>> state = module.state_dict()
>>> print(state.keys())  # Expected output: dict_keys(['parameters', 'states'])

load_state_dict(state_dict: Dict[str, Any], strict: bool = True) → None¶

Load the module’s state from a state dictionary.

Expects a dict of the form:

{
    "parameters": { "weight": "array", "bias": "array" },
    "states": { "stateful_states": "array" }
}

Parameters:

state_dict (Dict[str, Any]) – A dictionary containing the module’s parameters and states.
strict (bool) – If True, parameter/state keys must match the current module.

Examples

>>> # Save a state dictionary and later load it into the module.
>>> state = module.state_dict()
>>> module.load_state_dict(state)

num_parameters() → int¶

Calculate the total number of trainable parameters in the module and its submodules.

Returns:: int – The total number of parameters.

Examples

>>> print(module.num_parameters())

train() → None¶

Set the module and all its submodules to training mode.

Examples

>>> module.train()

eval() → None¶

Set the module and all its submodules to evaluation mode.

Examples

>>> module.eval()

class autograd.nn.ModuleList(modules=None)¶

Bases: Module

A container for holding submodules in a list-like structure.

This container registers each submodule so that they are included in the module’s parameters and state dictionaries.

Examples

>>> # Create a ModuleList with two simple modules.
>>> class MyModule(Module):
...     def forward(self, x):
...         return x + 1
>>> ml = ModuleList([MyModule(), MyModule()])
>>> for m in ml:
...     print(m.forward(Tensor(xp.array([1]))).data)
[2]
[2]

append(module: Module) → None¶

Append a module to the ModuleList.

Parameters:: module (Module) – The module to append.

Examples

>>> ml = ModuleList()
>>> ml.append(MyModule())

class autograd.nn.Linear(input_size: int, output_size: int, **kwargs: Any)¶

Bases: Module

A linear (fully connected) layer.

This layer performs a linear transformation:

\[ y = xW + b \]

where $W$ is the weight matrix and $b$ is the bias.

Examples

>>> linear = Linear(4, 2)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(3, 4))
>>> y = linear(x) # Expected shape: (3, 2)

Compute the forward pass of the Linear layer.

Parameters:: x (Union[Tensor, xp.ndarray]) – The input tensor.
Returns:: Tensor – The result of the linear transformation.

Examples

>>> linear = Linear(5, 3)
>>> import cupy as np
>>> x = Tensor(xp.random.randn(10, 5))
>>> y = linear(x) # Expected: (10, 3)

class autograd.nn.Conv2d(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding_mode: str = 'valid', bias: bool = True, **kwargs: Any)¶

Bases: Module

A 2D convolutional layer.

This layer applies a convolution operation over a 4D input tensor with shape (N, in_channels, H, W) and produces an output tensor with shape (N, out_channels, H_out, W_out).

Examples

>>> conv = Conv2d(in_channels=3, out_channels=8, kernel_size=3, stride=1, padding_mode="same")
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(2, 3, 32, 32))
>>> y = conv(x) # Expected shape: (2, 8, 32, 32)

Compute the forward pass of the Conv2d layer.

Parameters:: x (Union[Tensor, xp.ndarray]) – Input tensor of shape (N, in_channels, H, W).
Returns:: Tensor – Output tensor after applying the convolution and bias addition.

Examples

>>> conv = Conv2d(3, 8, kernel_size=3, stride=1, padding_mode="same")
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(2, 3, 32, 32))
>>> y = conv(x) # Expected: (2, 8, 32, 32)

class autograd.nn.MaxPool2d(kernel_size: int, stride: int | None = None, padding_mode: str = 'valid', **kwargs: Any)¶

Bases: Module

A 2D max pooling layer.

This layer performs max pooling over a sliding window of the input tensor.

Examples

>>> pool = MaxPool2d(kernel_size=2, stride=2, padding_mode="valid")
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(1, 3, 32, 32))
>>> y = pool(x) # Expected: (1, 3, 16, 16)

Compute the forward pass of the MaxPool2d layer.

Parameters:: x (Union[Tensor, xp.ndarray]) – Input tensor.
Returns:: Tensor – Tensor after applying max pooling.

Examples

>>> pool = MaxPool2d(kernel_size=2, stride=2)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(1, 3, 32, 32))
>>> y = pool(x) # Expected: (1, 3, 16, 16)

class autograd.nn.ResidualBlock(in_channels: int, out_channels: int, stride: int = 1)¶

Bases: Module

Residual Block.

Implements a residual block that computes:

\[ H(x) = F(x) + x \]

where x is the identity mapping. H(x) = F(x) + x, where x is the identity mapping. Paper: https://arxiv.org/abs/1512.03385

Currently this wraps the Convolution block inside. TODO: Remove the convolutional block

Examples

>>> res_block = ResidualBlock(16, 16, stride=1)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(1, 16, 32, 32))
>>> y = res_block(x) # Expected: (1, 16, 32, 32)

Compute the forward pass of the ResidualBlock.

Parameters:: x (Union[Tensor, xp.ndarray]) – Input tensor.
Returns:: Tensor – Output tensor after applying the residual block.

Examples

>>> res_block = ResidualBlock(16, 16)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(1, 16, 32, 32))
>>> y = res_block(x) # Expected: (1, 16, 32, 32)

class autograd.nn.RecurrentBlock(input_size: int, hidden_size: int, output_size: int | None = None, dropout_prob: float | None = None)¶

Bases: Module

Recurrent Neural Network (RNN) block.

Implements a simple RNN that processes a sequence and returns either the final hidden state or an output computed from the final hidden state if output_size is specified. Paper: https://arxiv.org/abs/1308.0850

Examples

>>> rnn = RecurrentBlock(input_size=4, hidden_size=8, output_size=2)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> # Create a random sequence: batch_size=3, sequence_length=5, input_size=4
>>> x = Tensor(xp.random.randn(3, 5, 4))
>>> y = rnn(x) # Expected: (3, 2)

Perform the forward pass of the RNN.

Parameters:: x (Union[Tensor, xp.ndarray]) – Input tensor of shape (batch_size, sequence_length, input_size).
Returns:: Tensor – Output tensor computed from the final hidden state or the hidden state itself.

Examples

>>> rnn = RecurrentBlock(input_size=4, hidden_size=8, output_size=3)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(3, 5, 4))
>>> y = rnn(x) # Expected: (3, 3)

class autograd.nn.LongShortTermMemoryBlock(input_size: int, hidden_size: int, output_size: int | None = None, dropout_prob: float | None = None)¶

Bases: Module

Long Short-Term Memory (LSTM) block.

Implements an LSTM that processes a sequence and returns the final output and cell state.

Paper: https://www.bioinf.jku.at/publications/older/2604.pdf

Examples

>>> lstm = LongShortTermMemoryBlock(input_size=4, hidden_size=8, output_size=3)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(3, 5, 4))
>>> output, cell_state = lstm(x)
>>> print(output.data.shape)  # Expected: (3, 3)
>>> print(cell_state.data.shape)  # Expected: (3, 8)

Perform the forward pass of the LSTM.

Parameters:

x (Union[Tensor, xp.ndarray]) – Input tensor of shape (batch_size, sequence_length, input_size).
hidden_state (Optional[Tensor], optional) – Initial hidden state. Defaults to zeros.
C_t (Optional[Tensor], optional) – Initial cell state. Defaults to zeros.

Returns:

Tuple[Tensor, Tensor] – A tuple containing the output and the final cell state.

Examples

>>> lstm = LongShortTermMemoryBlock(input_size=4, hidden_size=8, output_size=3)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(3, 5, 4))
>>> output, cell_state = lstm(x)
>>> print(output.data.shape)  # Expected: (3, 3)
>>> print(cell_state.data.shape)  # Expected: (3, 8)

class autograd.nn.Embedding(input_size: int, embedding_size: int)¶

Bases: Module

Embedding layer that projects an arbitrary input_size down to embedding_size.

Examples

>>> embed = Embedding(input_size=100, embedding_size=16)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> # Create a batch of indices with shape (batch_size, seq_len)
>>> x = Tensor(xp.array([[1, 5, 20], [2, 10, 30]]))
>>> y = embed(x) # Expected: (2, 3, 16)

Perform the forward pass of the Embedding layer.

Parameters:: x (Union[Tensor, xp.ndarray]) – Input tensor of shape (batch_size, seq_len).
Returns:: Tensor – Output tensor of shape (batch_size, seq_len, embedding_size).

Examples

>>> embed = Embedding(input_size=50, embedding_size=8)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.array([[0, 1, 2], [3, 4, 5]]))
>>> y = embed(x)  # Expected: (2, 3, 8)

class autograd.nn.LayerNorm(input_size: int, epsilon: float = 1e-05, **kwargs: Any)¶

Bases: Module

Layer Normalization.

Normalizes the summed inputs to neurons for each training example. Paper: https://arxiv.org/abs/1607.06450

Examples

>>> ln = LayerNorm(input_size=10)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(4, 10))
>>> y = ln(x) # Expected: (4, 10)

forward(x: Tensor) → Tensor¶

Perform the forward pass of LayerNorm.

Parameters:: x (Tensor) – Input tensor.
Returns:: Tensor – Normalized tensor scaled and shifted by learnable parameters.

Examples

>>> ln = LayerNorm(input_size=10)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(2, 10))
>>> y = ln(x)  # Expected: (2, 10)

class autograd.nn.BatchNorm(input_size: int, momentum: float = 0.1, epsilon: float = 1e-05, **kwargs: Any)¶

Bases: Module

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Normalizes the input tensor by subtracting the batch mean and dividing by the batch standard deviation. Paper: http://arxiv.org/abs/1502.03167

Examples

>>> bn = BatchNorm(input_size=10)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(4, 10))
>>> y = bn(x) # Expected: (4, 10)

forward(x: Tensor) → Tensor¶

Perform the forward pass of BatchNorm.

Note that the backward pass is implemented via primitive operations in the Tensor class. The operations in the forward pass have all been implemented as Tensor-level operations.

Parameters:: x (Tensor) – Input tensor.
Returns:: Tensor – Normalized tensor with learnable scaling and shifting.

Examples

>>> bn = BatchNorm(input_size=10)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(4, 10))
>>> y = bn(x) # Expected: (4, 10)

class autograd.nn.Dropout(p: float = 0.5, **kwargs: Any)¶

Bases: Module

Dropout layer.

Randomly sets a fraction of input units to 0 during training to prevent overfitting. “It prevents overfitting and provides a way of approximately combining exponentially many different neural network architectures efficiently.” Paper: https://arxiv.org/abs/1207.0580

Examples

>>> dropout = Dropout(p=0.5)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.ones((4, 4)))
>>> dropout.train()  # Set to training mode to apply dropout
>>> y = dropout(x) # Approximately half of the elements should be 0

forward(x: Tensor) → Tensor¶

Perform the forward pass of Dropout.

Parameters:: x (Tensor) – Input tensor.
Returns:: Tensor – Tensor after applying dropout (only during training).

Examples

>>> dropout = Dropout(p=0.5)
>>> dropout.train()
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.ones((2, 2)))
>>> y = dropout(x) # Approximately half of the values in y should be zero

class autograd.nn.ScaledDotProductAttention(dropout_prob: float = 0.1)¶

Bases: Module

Scaled Dot-Product Attention layer.

Computes attention scores as:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{key\_dim}}\right) V \]

Implements the Scaled Dot-Product Attention in Section 3.2.1 in paper: https://arxiv.org/abs/1706.03762

Examples

>>> attn = ScaledDotProductAttention(dropout_prob=0.1)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> # Create dummy query, key, value tensors with shape (batch_size, num_heads, seq_len, key_dim)
>>> query = Tensor(xp.random.randn(2, 2, 4, 8))
>>> key = Tensor(xp.random.randn(2, 2, 4, 8))
>>> value = Tensor(xp.random.randn(2, 2, 4, 8))
>>> y = attn(query, key, value) # Expected shape: (2, 2, 4, 8)

Compute the scaled dot-product attention.

Parameters:

query (Tensor) – Query tensor.
key (Tensor) – Key tensor.
value (Tensor) – Value tensor.
mask (Optional[Union[Tensor, ArrayLike]], optional) – Attention mask. The dense repo contract uses additive float masks where 1.0 means forbidden and 0.0 means allowed. Raw backend bool masks are also accepted and routed through the implementation-specific gating logic before falling back to dense when unsupported. Defaults to None.
is_causal (bool, optional) – Whether to apply structural causal masking when no explicit mask is supplied. Defaults to False.

Returns:

Tensor – The attended output.

Examples

>>> attn = ScaledDotProductAttention()
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> query = Tensor(xp.random.randn(2, 2, 4, 8))
>>> key = Tensor(xp.random.randn(2, 2, 4, 8))
>>> value = Tensor(xp.random.randn(2, 2, 4, 8))
>>> output = attn(query, key, value) # Expected: (2, 2, 4, 8)

class autograd.nn.MultiHeadAttention(num_heads: int, hidden_size: int, dropout_prob: float = 0.1)¶

Bases: Module

Multi-Head Attention layer.

Instead of performing a single attention with hidden_size keys, query, and values, we project them “num_heads” times with different learned linear projects Implements the Multi-Head Attention in Section 3.2.2 in the paper: https://arxiv.org/abs/1706.03762

Examples

>>> mha = MultiHeadAttention(num_heads=2, hidden_size=16, dropout_prob=0.1)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> # Create dummy input tensors with shape (batch_size, seq_len, hidden_size)
>>> x = Tensor(xp.random.randn(2, 5, 16))
>>> output = mha(x, x, x) # Expected: (2, 5, 16)

forward(query: Tensor, key: Tensor, value: Tensor, mask: Tensor | None = None, is_causal: bool = False) → Tensor¶

Compute the forward pass of the MultiHeadAttention layer.

Parameters:

query (Tensor) – Query tensor.
key (Tensor) – Key tensor.
value (Tensor) – Value tensor.
mask (Optional[Tensor], optional) – Mask tensor. Defaults to None.
is_causal (bool, optional) – Whether to apply structural causal masking when no explicit mask is supplied. Defaults to False.

Returns:

Tensor – Output tensor after multi-head attention.

Examples

>>> mha = MultiHeadAttention(num_heads=2, hidden_size=16)
>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(2, 5, 16))
>>> output = mha(x, x, x) # Expected: (2, 5, 16)

class autograd.nn.AbstractLLMForwardFn¶

Bases: ABC

Abstract interface for a language modeling forward function.

Subclasses should implement the sample and train methods.

The trainer treats batch_data as opaque and relies on the forward function to interpret architecture-specific batch objects. This keeps the LM training loop shared across decoder-only and encoder-decoder models without forcing a universal batch dict or tuple shape.

Examples

>>> # Example subclass implementing the abstract methods.
>>> class DummyLLMForward(AbstractLLMForwardFn):
...     def sample(self, model, batch_data):
...         return model(batch_data)
...     def train(self, model, batch_data):
...         return model(batch_data.input_ids)
>>> forward_fn = DummyLLMForward()
>>> # Now forward_fn can be used as: forward_fn(model, data, mode="train")

abstractmethod sample(model: Any, batch_data: Any) → Any¶

Generate samples from the model.

Parameters:

model (Any) – The model to sample from.
batch_data (Any) – Data for the current batch.

Returns:

Any – Model outputs for sampling mode.

abstractmethod train(model: Any, batch_data: Any) → Any¶

Compute the forward pass for training.

Parameters:

model (Any) – The model to train.
batch_data (Any) – Data for the current batch.

Returns:

Any – Training-mode model outputs, typically logits.

Extract sliding windows from the input tensor while maintaining the computational graph.

Parameters:

x (Union[Tensor, xp.ndarray]) – Input tensor of shape (batch_size, channels, height, width).
kernel_size (int) – Size of the sliding window.
stride (int) – Step size between windows.
padding_mode (str, optional) – Padding mode (“valid” or “same”). Defaults to “valid”.

Returns:

Tuple[Tensor, Tuple[int, int]] – - windows: Stacked tensor of windows with shape (H_out, W_out, batch_size, channels, kernel_size, kernel_size). - output_shape: A tuple (H_out, W_out) representing the spatial dimensions of the output.

Examples

>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.random.randn(2, 3, 32, 32))
>>> windows, output_shape = extract_windows(x, kernel_size=3, stride=1, padding_mode="same")
>>> print(windows.data.shape)  # Expected: (H_out, W_out, 2, 3, 3, 3)
>>> print(output_shape)        # Expected: (32, 32) when padding_mode is "same"

autograd.optim module¶

class autograd.optim.LRScheduler¶

Bases: object

Interface for a learning rate scheduler.

Subclasses should implement the __call__ method.

class autograd.optim.CosineScheduler(warmup_steps: int = 100, lr_decay_iters: int = 200, min_lr: float = 0.0001, **kwargs)¶

Bases: LRScheduler

Cosine learning rate scheduler with warmup.

Implements Section 3 in “SGDR: Stochastic Gradient Descent with Warm Restarts”. Paper: https://arxiv.org/abs/1608.03983

class autograd.optim.Optimizer(model_parameters: Dict[str, Tensor], lr: float, lr_scheduler_kwargs: dict | None = None, **kwargs: Any)¶

Bases: object

Base Optimizer Class.

Usage Example:

optimizer = Optimizer(model.parameters(), lr=0.01, lr_scheduler_kwargs={
    "lr_scheduler_cls": CosineScheduler,
    "warmup_steps": 100,
    "lr_decay_iters": 5000,
    "min_lr": 1e-4
})
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

property lr: float¶

Get the current learning rate.

Returns:: float – The current learning rate.

property timestep: int¶

Get the current global timestep.

Returns:: int – The current timestep.

update_lr() → None¶: Update the learning rate using the provided scheduler and the global step.

clip_grad_norm(max_norm: float, norm_type: float = 2.0) → Any¶

Scale the gradients of all parameters in-place so that their norm is at most max_norm.

Implements Section 10.11.1 “Clipping Gradients” in the Deep Learning Book by Goodfellow et al.

The scaling is done according to:

\[ \frac{\text{max\_norm} \cdot g}{\|g\|_n} \]

where $n$ is the norm type.

Parameters:

max_norm (float) – The maximum allowed norm of the gradients.
norm_type (float, optional) – The type of norm to use (default is 2, Euclidean norm a.k.a. L2 norm).

zero_grad() → None¶: Set the gradients of all optimized tensors to zero.

scale_gradients(scale: Any) → None¶: Scale all parameter gradients in-place by scale.

grad_l2_norm() → float¶: Return the L2 norm of all current parameter gradients.

gradient_arrays() → Dict[str, Any]¶: Current accumulated gradient arrays, keyed by parameter name.

state_dict() → Dict[str, Any]¶

Return a dictionary representing the optimizer’s state for checkpointing.

The returned dictionary has the following structure:

{
    "hyperparams": {
        "lr": 0.01,
    },
    "states": {
        "module1.weight": {"m": "...", "v": "..." },
        "module1.bias": "..."
    }
}

Returns:: Dict[str, Any] – The state dictionary of the optimizer.

load_state_dict(state_dict: Dict[str, Any]) → None¶

Load the optimizer state from a checkpoint.

This performs an in-place update of the optimizer’s internal state and hyperparameters.

Parameters:: state_dict (Dict[str, Any]) – The state dictionary read from a checkpoint.

step() → None¶: Perform a single optimization step.

class autograd.optim.SGD(model_parameters: Any, lr: float, **kwargs: Any)¶

Bases: Optimizer

Stochastic Gradient Descent (SGD) Optimizer.

step() → None¶

Perform a single optimization step using SGD.

This method updates each parameter by subtracting the product of the learning rate and the parameter’s gradient.

class autograd.optim.Adam(model_parameters: Any, lr: float, beta1: float = 0.9, beta2: float = 0.999, epsilon: float = 1e-07, weight_decay: float = 0.0, **kwargs: Any)¶

Bases: Optimizer

Adam Optimizer.

Implements stochastic gradient descent with first and second order momentum. Paper: https://arxiv.org/abs/1412.6980

The weight_decay parameter implements decoupled weight decay as described in: “Decoupled Weight Decay Regularization” (https://arxiv.org/abs/1711.05101).

When weight_decay is set to 0, AdamW is equivalent to Adam.

step() → None¶

Perform a single optimization step using the Adam algorithm.

This method updates the biased first and second order momentum estimates, applies bias correction, performs a decoupled weight decay step if specified, and updates the parameters accordingly.

autograd.functional module¶

autograd.functional.relu(x: Tensor) → Tensor¶

Applies the Rectified Linear Unit (ReLU) activation function.

Parameters:: x (Tensor) – The input tensor.
Returns:: Tensor – The tensor after applying the ReLU function.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([-1, 0, 2])
>>> y = relu(x) # Expected output: [0, 0, 2]

autograd.functional.sigmoid(x: Tensor) → Tensor¶

Applies the sigmoid activation function.

Parameters:: x (Tensor) – The input tensor.
Returns:: Tensor – The tensor after applying the sigmoid function.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([0, 2])
>>> y = sigmoid(x) # Expected output: [0.5, ~0.88]

autograd.functional.softmax(x: Tensor) → Tensor¶

Applies the softmax activation function.

Parameters:: x (Tensor) – The input tensor containing logits.
Returns:: Tensor – The tensor with softmax probabilities.

Examples

>>> from autograd.tensor import Tensor
>>> import cupy as np
>>> x = Tensor(xp.array([2.0, 1.0, 0.1]))
>>> y = softmax(x) # Expected output: probabilities that sum to 1

autograd.functional.log_softmax(x: Tensor, dim: int = -1) → Tensor¶

Applies the log-softmax function.

Parameters:

x (Tensor) – The input tensor containing logits.
dim (int) – The dimension along which logprobs are computed.

Returns:

Tensor – The tensor with token logprobs.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor(xp.array([2.0, 1.0, 0.1]))
>>> y = log_softmax(x) # Expected output: logprobs that exponentiate to probabilities summing to 1

autograd.functional.tanh(x: Tensor) → Tensor¶

Applies the hyperbolic tangent (tanh) activation function.

Parameters:: x (Tensor) – The input tensor.
Returns:: Tensor – The tensor after applying the tanh function.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([0, 1])
>>> y = tanh(x) # Expected output: [0, tanh(1)]

autograd.functional.gelu(x: Tensor) → Tensor¶

Applies the Gaussian Error Linear Unit (GELU) activation function.

This function uses the approximate formula:

\[ 0.5 * x * \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \, (x + 0.044715*x^3)\right)\right) \]

Parameters:: x (Tensor) – The input tensor.
Returns:: Tensor – The tensor after applying the GELU function.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([1.0, -1.0])
>>> y = gelu(x) # Expected output: approximate GELU values for the inputs

autograd.functional.scaled_dot_product_attention_mlx_custom(query: Tensor, key: Tensor, value: Tensor, *, is_training: bool | None = None, dropout_prob: float = 0.0) → Tensor¶: Custom-implemented MLX kernel causal attention calculation

class autograd.functional.Relu(*tensors: Tensor)¶

Bases: Function

Rectified Linear Unit (ReLU) activation function.

The ReLU function is defined as:

\[ ReLU(x) = \max(0, x) \]

Note

This class is used internally. For applying ReLU, use the relu function.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([-3, 0, 3])
>>> y = Relu.apply(x) # Expected output: [0, 0, 3]

forward(x: Any) → Any¶

Computes the forward pass of the ReLU activation function.

Parameters:: x (xp.ndarray) – Input array.
Returns:: xp.ndarray – The result of applying ReLU to the input.

backward(grad: Tensor) → Any¶

Computes the backward pass of the ReLU activation function.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input.

class autograd.functional.Gelu(*tensors: Tensor)¶

Bases: Function

Gaussian Error Linear Unit (GELU) activation function. GELU(x) = x * P(X le x) where X ~ N(0, 1)

This activation function approximates:

\[ 0.5 * x * \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}} \, (x + 0.044715*x^3)\right)\right) \]

Paper: https://arxiv.org/abs/1606.08415

Note

Use the gelu function to apply this activation.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([0.5, -0.5])
>>> y = Gelu.apply(x) # Expected output: approximate GELU values

forward(x: Any) → Any¶

Computes the forward pass of the GELU activation function.

Parameters:: x (xp.ndarray) – Input array.
Returns:: xp.ndarray – The output array after applying GELU.

backward(grad: Any) → Any¶

Computes the backward pass of the GELU activation function.

The gradient is computed as:

\[ \frac{d\,GELU}{dx} = 0.5 \left(1 + tanh(\alpha)\right) + 0.5 \, x \, \left(1 - tanh^2(\alpha)\right) \alpha' \]

where

\[ \alpha = \sqrt{\frac{2}{\pi}} (x + 0.044715*x^3) \]

and

\[ \alpha' = \sqrt{\frac{2}{\pi}} \left(1 + 3*0.044715*x^2\right) \]

Parameters:: grad (xp.ndarray) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input.

class autograd.functional.Sigmoid(*tensors: Tensor)¶

Bases: Function

Sigmoid activation function.

The sigmoid function is defined as:

\[ sigmoid(x) = \frac{1}{1 + e^{-x}} \]

Note

Use the sigmoid function to apply this activation.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([0, 2])
>>> y = Sigmoid.apply(x) # Expected output: [0.5, ~0.88]

forward(x: Any) → Any¶

Computes the forward pass of the sigmoid function.

Parameters:: x (xp.ndarray) – Input array.
Returns:: xp.ndarray – The output after applying the sigmoid function.

backward(grad: Tensor) → Any¶

Computes the backward pass of the sigmoid function.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input.

class autograd.functional.Softmax(*tensors: Tensor)¶

Bases: Function

Softmax activation function.

The softmax function is defined as:

\[ softmax(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}} \]

Note

Use the softmax function to apply this activation.

Examples

>>> from autograd.tensor import Tensor
>>> import cupy as np
>>> x = Tensor(xp.array([1.0, 2.0, 3.0]))
>>> y = Softmax.apply(x) # Expected output: probabilities that sum to 1

forward(x: Any) → Any¶

Computes the forward pass of the softmax activation function.

Parameters:: x (xp.ndarray) – Input array of logits.
Returns:: xp.ndarray – The softmax probabilities.

backward(grad: Tensor) → Any¶

Computes the backward pass of the softmax activation function.

This function computes the gradient of the softmax output with respect to the input logits.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input logits.

class autograd.functional.LogSoftmax(*tensors: Tensor)¶

Bases: Function

Log-softmax activation function.

The log-softmax function is defined as:

\[ \log \pi_i = x_i - \log \sum_j e^{x_j} \]

Note

Use log_softmax when logprobs are needed; it is more stable than computing log(softmax(x)).

forward(x: Any, *, dim: int = -1) → Any¶

Computes the forward pass of the log-softmax activation function.

Parameters:

x (xp.ndarray) – Input array of logits.
dim (int) – Dimension along which logprobs are computed.

Returns:

xp.ndarray – The logprobs.

backward(grad: Tensor) → Any¶

Computes the backward pass of the log-softmax activation function.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input logits.

class autograd.functional.Tanh(*tensors: Tensor)¶

Bases: Function

Hyperbolic tangent (tanh) activation function.

The tanh function is defined as:

\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

Note

Use the tanh function to apply this activation.

Examples

>>> from autograd.tensor import Tensor
>>> x = Tensor([0, 1])
>>> y = Tanh.apply(x) # Expected output: [0, tanh(1)]

forward(x: Any) → Any¶

Computes the forward pass of the tanh activation function.

Parameters:: x (xp.ndarray) – Input array.
Returns:: xp.ndarray – The output after applying the tanh function.

backward(grad: Tensor) → Any¶

Computes the backward pass of the tanh activation function.

\[ d(tanh(x))/dx = 1 - tanh(x)^2 \]

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient of the loss with respect to the input.

class autograd.functional.ScaledDotProductAttentionMLXCustom(*tensors: Tensor)¶

Bases: Function

static build_dropout_scale_mask(*, is_training: bool | None, dropout_prob: float, query_shape: Tuple[int, ...], key_shape: Tuple[int, ...]) → Any | None¶

static validate(query: Any, key: Any, value: Any, *, dropout_scale_mask: Any | None = None) → None¶

forward(query: Any, key: Any, value: Any, *, dropout_scale_mask: Any | None = None) → Any¶

Perform the forward pass of this operation.

This method should be overridden by subclasses to define the specific behavior of the operation. It receives NumPy arrays corresponding to the data of the input tensors.

Parameters:

*args (xp.ndarray) – Data arrays for the input tensors.
**kwargs (Any) – Additional keyword arguments.

Returns:

xp.ndarray – The result of the forward pass as a NumPy array.

Raises:

NotImplementedError – If this method is not implemented in a subclass.

backward(grad: Tensor) → Tuple[Any, Any, Any]¶

The reason we have to define this backward function explicitly is because the kernel’s forward function is defined not with our Tensor ops, but with a lower-level kernel ops. That’s why we don’t get the backward implementation for free like other Function classes.

Backpropagate through the fused causal attention path.

The no-dropout forward contract is:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{key\_dim}}\right) V \]

Train mode keeps the repo’s post-softmax dropout contract:

\[ \widetilde{P} = P \odot M \]

\[ \text{Attention}_{\text{dropout}}(Q, K, V) = \widetilde{P}V \]

where $M$ is the already-scaled Bernoulli mask sampled in Python.

We still recompute the causal softmax probabilities inside backward instead of saving dense logits/probabilities from the fused forward kernel. We will use these intermediate names:

\[ \text{attention\_scores} = \frac{QK^T}{\sqrt{key\_dim}} \]

\[ P = \operatorname{softmax}(\text{attention\_scores}_{\text{causal}}) \]

\[\begin{split} \widetilde{P} = \begin{cases} P & \text{if dropout is disabled} \\ P \odot M & \text{if dropout is enabled} \end{cases} \end{split}\]

\[ \text{attention\_output} = \widetilde{P}V \]

Let loss be the final scalar training objective, and let the upstream gradient be:

\[ \text{upstream\_grad} = \frac{\partial \text{loss}}{\partial \text{attention\_output}} \]

1. Derivative of loss with respect to V:

\[ \frac{\partial \text{loss}}{\partial V} = \frac{\partial \text{loss}}{\partial \text{attention\_output}} \frac{\partial \text{attention\_output}}{\partial V} = \widetilde{P}^T \, \text{upstream\_grad} \]

2. Derivative of loss with respect to the pre-dropout probabilities:

\[\begin{split} \frac{\partial \text{loss}}{\partial P} = \left(\text{upstream\_grad} \, V^T\right) \odot \begin{cases} 1 & \text{if dropout is disabled} \\ M & \text{if dropout is enabled} \end{cases} \end{split}\]

3. Derivative of loss with respect to attention_scores: Softmax is applied independently to each query row, so we apply the softmax derivative row by row along the key dimension.

\[ \frac{\partial \text{loss}}{\partial \text{attention\_scores}} = \frac{\partial \text{loss}}{\partial P} \frac{\partial P}{\partial \text{attention\_scores}} = P \odot \left( \frac{\partial \text{loss}}{\partial P} - \sum_k \left( \frac{\partial \text{loss}}{\partial P_k} P_k \right) \right) \]

4. Derivative of loss with respect to Q and K, using

\[ \text{attention\_scores} = \frac{QK^T}{\sqrt{key\_dim}} \]

\[ \frac{\partial \text{loss}}{\partial Q} = \frac{\partial \text{loss}}{\partial \text{attention\_scores}} \frac{\partial \text{attention\_scores}}{\partial Q} = \frac{\partial \text{loss}}{\partial \text{attention\_scores}} K \cdot \frac{1}{\sqrt{key\_dim}} \]

\[ \frac{\partial \text{loss}}{\partial K} = \frac{\partial \text{loss}}{\partial \text{attention\_scores}} \frac{\partial \text{attention\_scores}}{\partial K} = \left( \frac{\partial \text{loss}}{\partial \text{attention\_scores}} \right)^T Q \cdot \frac{1}{\sqrt{key\_dim}} \]

class autograd.functional.BinaryCrossEntropy(*tensors: Tensor)¶

Bases: Function

Binary Cross Entropy (BCE) Loss.

This loss function assumes that $y_{pred}$ contains probabilities rather than logits. If the input is logits, use binary_cross_entropy_with_logits().

The loss is computed as:

\[ BCE = -\left( y_{true} \cdot \log(y_{pred}) + (1 - y_{true}) \cdot \log(1 - y_{pred}) \right) \]

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([0.9, 0.2, 0.1])
>>> y_true = Tensor([1, 0, 0])
>>> loss = BinaryCrossEntropy.apply(y_pred, y_true) # Expected output: a small loss value

forward(y_pred: Any, y_true: Any, **kwargs: Any) → Any¶

Computes the binary cross entropy loss.

Parameters:

y_pred (xp.ndarray) – Predicted probabilities.
y_true (xp.ndarray) – True binary labels.
**kwargs – Additional keyword arguments.

Returns:

float – The computed binary cross entropy loss.

backward(grad: Tensor) → Tuple[Any, None]¶

Computes the gradient of the binary cross entropy loss with respect to $y_{pred}$.

The gradient is given by:

\[ \frac{\partial L}{\partial y_{pred}} = -\left(\frac{y_{true}}{y_{pred}} - \frac{1-y_{true}}{1-y_{pred}}\right) \]

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: Tuple[xp.ndarray, None] – A tuple containing the gradient with respect to $y_{pred}$ and None for $y_{true}$.

class autograd.functional.BinaryCrossEntropyWithLogits(*tensors: Tensor)¶

Bases: Function

Binary Cross Entropy Loss with logits.

This implementation is numerically stable for logits input.

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([2.0, -1.0, -2.0])  # logits
>>> y_true = Tensor([1, 0, 0])
>>> loss = BinaryCrossEntropyWithLogits.apply(y_pred, y_true) # Expected output: a loss value computed using logits

forward(y_pred: Any, y_true: Any) → Any¶

Computes the binary cross entropy loss with logits input.

Parameters:

y_pred (xp.ndarray) – shape (N, …) Unbounded real-valued logits.
y_true (xp.ndarray) – True binary labels (0 or 1), same shape as y_pred.

Returns:

float – The computed binary cross entropy loss.

backward(grad: Tensor) → Tuple[Any, None]¶

Computes the gradient of the binary cross entropy loss with logits with respect to $y_{pred}$.

The gradient is given by:

\[ \frac{\partial L}{\partial y_{pred}} = sigmoid(y_{pred}) - y_{true} = -(\frac{y_{true}}{y_{pred}} - \frac{1-y_{true}}{1-y_{pred}}) \]

Where

\[ sigmoid = 1 / (1 + exp(-y_{pred})) \]

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: Tuple[xp.ndarray, None] – A tuple containing the gradient with respect to $y_{pred}$ and None for $y_{true}$.

class autograd.functional.CrossEntropy(*tensors: Tensor)¶

Bases: Function

Cross-Entropy Loss for multi-dimensional predictions with optional ignored targets and label smoothing.

This function accepts raw logits (not probabilities) and computes a stable log-softmax internally.

Examples

>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> y_pred = Tensor(xp.array([[2.0, 1.0, 0.1]]))
>>> y_true = Tensor(xp.array([0]))
>>> loss = CrossEntropy.apply(y_pred, y_true, ignore_index=-100, label_smoothing=0.1)

forward(y_pred: Any, y_true: Any, ignore_index: int = -100, label_smoothing: float = 0.0, reduction: str = 'mean') → Any | float¶

Computes the cross-entropy loss with optional ignored targets and label smoothing.

Parameters:

y_pred (xp.ndarray) – Raw logits. Shape can be $(batch\_size, feature\_dim)$ or $(batch\_size, seq\_len, feature\_dim)$.
y_true (Union[xp.ndarray, Tensor]) – True class indices. If $y_{pred}$ is 2D, shape is $(batch\_size,)$; if 3D, shape is $(batch\_size, seq\_len)$.
ignore_index (int, optional) – Target value to ignore in the loss and gradient. Defaults to -100.
label_smoothing (float, optional) – Label smoothing factor. Defaults to 0.0. Label smoothing is applied if $label\_smoothing > 0$
reduction (str, optional) – Either "mean" or "sum". "mean" divides by the total non-ignored target weight; "sum" returns the summed loss over non-ignored targets.
smoothing (For label)
notation. (we follow the Inception paper)
Here
example (- $x$ is the current training)
$x$ (- $y$ is the ground-truth class index for)
index (- $k$ is a class)
classes (- $K$ is the total number of)
$delta_{k (-)
delta (y}$ is the Kronecker)
otherwise (equal to $1$ when $k=y$ and $0$)
$u (-)
$$
q' (k mid x) = (1 - epsilon),delta_{k,y} + epsilon,u(k)
$$
paper (and for the uniform prior used in the)

:param : :param

\[: :param u: :type u: k :param \]

: :param so the smoothed target distribution becomes: :param

\[: :param q': :type q': k \mid x) = (1 - \epsilon :param \]

: :param In per-example / per-class notation: :param where $y_{i: :param j}$ is the one-hot: :param target entry for example $i$ and class $j$: :param this is: :param

\[: :param T_{i: :type T_{i: 1 - \epsilon :param j} =: :type j} =: 1 - \epsilon :param \]

:

(Ref: “Rethinking the Inception Architecture for Computer Vision”, https://arxiv.org/abs/1512.00567)

Returns:: Union[xp.array, float] – The average cross-entropy loss over non-ignored positions.

backward(grad: Tensor) → Tuple[Any, None]¶

Computes the backward pass for the cross-entropy loss with label smoothing.

The gradient with respect to the logits is given by:

\[ \frac{\partial L}{\partial logits_{i,j}} = p_{i,j} - T_{i,j} \]

where $p_{i,j}$ is the stored model probability for example $i$ and class $j$, and the target distribution $T_{i,j}$ is defined as:

\[ T_{i,j} = (1 - \epsilon)\,y_{i,j} + \frac{\epsilon}{K} \]

The gradient is zeroed out for ignored positions and scaled by the upstream gradient divided by the number of non-ignored positions.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: Tuple[xp.ndarray, None] – A tuple containing the gradient with respect to the logits and None for $y_{true}$.

class autograd.functional.HingeLoss(*tensors: Tensor)¶

Bases: Function

Hinge Loss.

The hinge loss is defined as:

\[ loss = \max(0, 1 - y_{true} \cdot y_{pred}) \]

For correctly classified points ($y_{true} \cdot y_{pred} \geq 1$), the loss is 0; otherwise, it is $1 - y_{true} \cdot y_{pred}$. This is because loss functions typically don’t go into the negatives so we take the max of 0 and 1 - y_true * y_pred)

The objective function typically includes a regularization term:

\[ \|w\|^2 + C \sum max(0, 1 - y_{true} \cdot y_{pred}) \]

where $C$ is a hyperparameter controlling the trade-off between maximizing the margin (through regularization) and minimizing the loss, and $w$ is the weight vector. ($\|w\|^2$ is the regularization term)

Paper: https://ieeexplore.ieee.org/document/708428

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([0.8, -0.5, 0.3])
>>> y_true = Tensor([1, -1, 1])
>>> loss = HingeLoss.apply(y_pred, y_true, reduction="mean") # Expected output: average hinge loss

forward(y_pred: Any, y_true: Any, reduction: str = 'none', **kwargs: Any) → Any | float¶

Computes the hinge loss.

Parameters:

y_pred (xp.ndarray) – Predicted scores.
y_true (Union[xp.ndarray, Tensor]) – True labels.
reduction (str, optional) – “none”, “mean”, or “sum”. Defaults to “none”.
**kwargs – Additional keyword arguments.

Returns:

Union[xp.array, float] – The computed hinge loss.

backward(grad: Tensor) → Tuple[Any, None]¶

Computes the gradient of the hinge loss with respect to the predictions.

For each element, the gradient is:

\[\begin{split} \begin{align} \frac{\partial loss}{\partial y_{pred}} \\ &= \frac{d(C * sum(max(0, 1 - y_{true} * y_{pred})))}{dw} \\ &= C * max(0, 1 - y_{true} * y_{pred}) \\ &=\begin{cases} -y_{true}, & \text{if } y_{true} \cdot y_{pred} < 1 \\ 0, & \text{otherwise} \end{cases} \end{align} \end{split}\]

For the gradient of w in the regularization term,

\[\frac{d(\frac{1}{2}\|w\|^2)}{dw} = w\]

(we multiple 1/2 because it makes the gradient calculation easier)

If the reduction is “mean”, the gradient is averaged over the number of elements.

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: Tuple[xp.ndarray, None] – A tuple containing the gradient with respect to $y_{pred}$ and None for $y_{true}$.

class autograd.functional.MeanSquaredLoss(*tensors: Tensor)¶

Bases: Function

Mean Squared Error (MSE) Loss.

The MSE loss is defined as:

\[ MSE = \frac{1}{N} \sum (y_{pred} - y_{true})^2 \]

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([3.0, 5.0])
>>> y_true = Tensor([2.0, 5.0])
>>> loss = MeanSquaredLoss.apply(y_pred, y_true) # Expected output: 0.5

forward(y_pred: Any, y_true: Any, **kwargs: Any) → Any¶

Computes the Mean Squared Error loss.

Parameters:

y_pred (xp.ndarray) – Predicted values.
y_true (xp.ndarray) – True values.
**kwargs – Additional keyword arguments.

Returns:

float – The computed MSE loss.

backward(grad: Tensor) → Any¶

Computes the gradient of the Mean Squared Error loss with respect to the predictions.

The gradient is given by:

\[ \frac{\partial L}{\partial y_{pred}} = \frac{2}{N} (y_{pred} - y_{true}) \]

Parameters:: grad (Tensor) – Upstream gradient.
Returns:: xp.ndarray – The gradient with respect to y_pred.

Computes the binary cross entropy loss given predicted probabilities.

This function wraps the BinaryCrossEntropy operation.

Parameters:

y_pred (Tensor) – Predicted probabilities.
y_true (Union[Tensor, ArrayLike]) – True binary labels.
**kwargs – Additional keyword arguments.

Returns:

Tensor – The computed binary cross entropy loss.

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([0.9, 0.2, 0.1])
>>> y_true = [1, 0, 0]
>>> loss = binary_cross_entropy(y_pred, y_true)

Computes the binary cross entropy loss using logits input for improved numerical stability.

If $y_{pred}$ contains probabilities, use binary_cross_entropy() instead.

The loss is computed as:

\[ -\left(y_{true} \cdot \log(y_{pred}) + (1 - y_{true}) \cdot \log(1 - y_{pred})\right) \]

Parameters:

y_pred (Tensor) – Logits.
y_true (Union[Tensor, ArrayLike]) – True binary labels.
**kwargs – Additional keyword arguments.

Returns:

Tensor – The computed binary cross entropy loss.

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([2.0, -1.0, -2.0])
>>> y_true = [1, 0, 0]
>>> loss = binary_cross_entropy_with_logits(y_pred, y_true)

Computes the cross-entropy loss for multi-class classification with logits.

This function expects $y_{pred}$ to be raw logits and $y_{true}$ to be class indices (not one-hot vectors).

Parameters:

y_pred (Tensor) – Raw logits.
y_true (Union[Tensor, ArrayLike]) – True class indices.
ignore_index (int, optional) – Target value to ignore in the loss and gradient. Defaults to -100.
label_smoothing (float, optional) – Label smoothing factor. Defaults to 0.0.
reduction (str, optional) – Either "mean" or "sum". Defaults to "mean".

Returns:

Tensor – The computed cross-entropy loss.

Examples

>>> import cupy as np
>>> from autograd.tensor import Tensor
>>> y_pred = Tensor(xp.array([[2.0, 1.0, 0.1]]))
>>> y_true = Tensor(xp.array([0]))
>>> loss = cross_entropy(y_pred, y_true, ignore_index=-100, label_smoothing=0.1)

Computes the hinge loss for binary classification.

Parameters:

y_pred (Tensor) – Predicted scores.
y_true (Union[Tensor, ArrayLike]) – True labels.
reduction (str, optional) – Specifies the reduction to apply: “none”, “mean”, or “sum”. Defaults to “none”.
**kwargs – Additional keyword arguments.

Returns:

Tensor – The computed hinge loss.

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([0.8, -0.5, 0.3])
>>> y_true = [1, -1, 1]
>>> loss = hinge_loss(y_pred, y_true, reduction="mean")

Computes the Mean Squared Error (MSE) loss.

Parameters:

y_pred (Tensor) – Predicted values.
y_true (Union[Tensor, ArrayLike]) – True values.
**kwargs – Additional keyword arguments.

Returns:

Tensor – The computed MSE loss.

Examples

>>> from autograd.tensor import Tensor
>>> y_pred = Tensor([3.0, 5.0])
>>> y_true = [2.0, 5.0]
>>> loss = mean_squared_loss(y_pred, y_true) # Expected output: 0.5

autograd.init module¶

Weight initialization methods for neural network parameters.

autograd.init.xavier_uniform(tensor: Tensor) → Tensor¶

Applies in-place Xavier Uniform Initialization to the given tensor.

This method initializes the weights of a neural network using the Xavier (Glorot) uniform initialization technique, as described in this paper: https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

The weight tensor is assumed to have the shape:: (input_size, output_size, additional_dimensions…)

The limits for the uniform distribution are computed using the number of input and output tensor counts of the given tensor, where the limit is given by:

\[ \text{limit} = \sqrt{\frac{6}{\text{\# of input tensor count} + \text{\# of output tensor count}}} \]

Parameters:: tensor (Tensor) – The tensor to initialize. Its underlying data should be an array.
Returns:: Tensor – The same tensor after in-place initialization.

Examples

>>> from autograd.backend import xp
>>> from autograd.tensor import Tensor
>>> # Create an uninitialized tensor with shape (3, 4)
>>> tensor = Tensor(xp.empty((3, 4)))
>>> # Initialize the tensor using Xavier Uniform initialization
>>> tensor = xavier_uniform(tensor)

autograd.init.compute_in_out_tensor_count(tensor: Tensor) → tuple[int, int]¶

Computes the number of input and output tensor counts for the given tensor.

For convolution kernels:

tensor.shape[0] is assumed to represent the number of output channels.
tensor.shape[1] is assumed to represent the number of input channels.
The remaining dimensions correspond to the spatial kernel sizes (e.g., kernel height, kernel width).

The counts are computed as follows:

\[\begin{split} \begin{align} \text{\# of input tensor count} &= \text{tensor.shape}[0] \times \prod_{i=2}^{n} \text{tensor.shape}[i] \\ \text{\# of output tensor count} &= \text{tensor.shape}[-1] \times \prod_{i=2}^{n} \text{tensor.shape}[i] \end{align} \end{split}\]

where $ n $ is the number of dimensions in the tensor.

Parameters:: tensor (Tensor) – The tensor for which to compute the number of input and output tensor counts. The tensor must have at least 2 dimensions.
Returns:: Tuple[int, int] – A tuple containing (input_tensor_count, output_tensor_count).
Raises:: ValueError – If the tensor has fewer than 2 dimensions.

Examples

>>> from autograd.backend import xp
>>> from autograd.tensor import Tensor
>>> # Example for a fully-connected layer weight matrix with shape (fan_in, fan_out)
>>> tensor_fc = Tensor(xp.empty((5, 10)))
>>> compute_in_out_tensor_count(tensor_fc.data)
(5, 10)
>>> # Example for a convolution kernel with shape
>>> # (output_channels, input_channels, kernel_height, kernel_width)
>>> tensor_conv = Tensor(xp.empty((16, 3, 3, 3)))
>>> # The receptive field size is 3*3 = 9
>>> # So, input tensor count = 16 * 9 = 144, output tensor count = 3 * 9 = 27
>>> compute_in_out_tensor_count(tensor_conv.data)
(144, 27)

autograd.logger module¶

class autograd.logger.ColorFormatter(fmt=None, datefmt=None, style='%', validate=True, *, defaults=None)¶

Bases: Formatter

A logging formatter that adds ANSI color codes to log messages based on their severity level.

This formatter maps each logging level to a specific color for improved readability in the console output.

Examples

>>> import logging
>>> from your_module import ColorFormatter  # Replace 'your_module' with your actual module name.
>>> # Create a logger and attach a stream handler with the ColorFormatter.
>>> logger = logging.getLogger("example")
>>> handler = logging.StreamHandler()
>>> handler.setFormatter(ColorFormatter())
>>> logger.addHandler(handler)
>>> logger.setLevel(logging.DEBUG)
>>> logger.debug("This is a debug message.")
>>> logger.info("This is an info message.")
>>> logger.warning("This is a warning message.")
>>> logger.error("This is an error message.")
>>> logger.critical("This is a critical message.")
# Each message will appear in a color corresponding to its log level.

grey = '\x1b[38;20m'¶

yellow = '\x1b[33;20m'¶

red = '\x1b[31;20m'¶

bold_red = '\x1b[31;1m'¶

cyan = '\x1b[36;20m'¶

green = '\x1b[32;20m'¶

reset = '\x1b[0m'¶

FORMATS = {10: '\x1b[36;20m', 20: '\x1b[32;20m', 30: '\x1b[33;20m', 40: '\x1b[31;20m', 50: '\x1b[31;1m'}¶

format(record)¶

Format the specified record as a colored log message.

The method selects a color based on the log record’s level and applies it to the formatted message.

Parameters:: record (logging.LogRecord) – The log record to be formatted.
Returns:: str – The formatted log message with ANSI color codes.

autograd.logger.setup_logger(name=None)¶

Set up and configure a logger with colored console output.

This function creates a logger with the given name, sets its logging level to DEBUG if the environment variable DEBUG is set; otherwise, it defaults to INFO level. It then attaches a console handler with a color formatter and returns the configured logger.

Parameters:: name (str, optional) – The name of the logger. Defaults to None.
Returns:: logging.Logger – The configured logger instance.

Examples

>>> import os
>>> # Optionally set the DEBUG environment variable to enable debug logging.
>>> os.environ["DEBUG"] = "1"
>>> from your_module import setup_logger  # Replace 'your_module' with your actual module name.
>>> logger = setup_logger("my_logger")
>>> logger.info("This is an informational message.")
>>> logger.error("This is an error message.")
# The output in the console will display the messages in colors corresponding to their log levels.

Module contents¶

Subpackages¶

autograd.text package

autograd package¶

autograd.tensor module¶

autograd.nn module¶

autograd.optim module¶

autograd.functional module¶

autograd.init module¶

autograd.logger module¶

Module contents¶

Subpackages¶

ML-By-Hand

Navigation

Related Topics