Hyperbolic tangent activation function that maps inputs to outputs between -1 and 1, offering zero-centered outputs and smooth gradients.

Tanh (Hyperbolic Tangent)

Tanh (hyperbolic tangent) is an activation function that maps input values to outputs between -1 and 1. It’s essentially a scaled and shifted version of the sigmoid function, providing zero-centered outputs which can lead to more efficient learning in neural networks. Tanh was historically popular and remains useful in specific applications, particularly recurrent neural networks.

Mathematical Definition

Tanh Formula tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Alternative Forms

Exponential form: tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
Sigmoid relation: tanh(x) = 2σ(2x) - 1
Inverse relation: σ(x) = (1 + tanh(x/2)) / 2

Properties

Range: (-1, 1)
Zero-centered: Output centered around zero
Odd function: tanh(-x) = -tanh(x)
Smooth: Infinitely differentiable

Key Characteristics

S-Shaped Curve Similar to sigmoid but centered:

Steep transition: Around x = 0
Saturation: Approaches ±1 for large |x|
Zero-crossing: Passes through origin
Symmetry: Odd function symmetry

Zero-Centered Output Advantage over sigmoid:

Mean activation: Around zero
Efficient gradients: Faster convergence
Weight updates: More efficient learning
No bias: Reduces internal covariate shift

Derivative Properties Gradient characteristics:

Formula: tanh’(x) = 1 - tanh²(x)
Maximum: At x = 0 where tanh’(0) = 1
Range: (0, 1]
Self-referential: Derivative in terms of function value

Advantages

Zero-Centered Outputs Primary benefit over sigmoid:

Balanced activation: Positive and negative outputs
Gradient efficiency: More efficient weight updates
Faster convergence: Often converges faster than sigmoid
Reduced bias: Less internal covariate shift

Strong Gradients Better gradient properties:

Maximum gradient: 1 at origin (vs 0.25 for sigmoid)
Stronger learning signal: Facilitates learning
Less saturation: Around zero region
Better backpropagation: Stronger gradient flow

Smooth Non-Linearity Desirable mathematical properties:

Differentiable: Smooth everywhere
Bounded: Output range well-controlled
Monotonic: Preserves input ordering
Continuous: No discontinuities

Applications

Recurrent Neural Networks Traditional choice for RNNs:

Hidden states: RNN hidden state activation
Vanilla RNNs: Standard activation choice
Memory: Bounded activation helps with stability
Gradient flow: Better than sigmoid for sequences

LSTM Components Specific uses in LSTM cells:

Cell state: tanh for cell state candidates
Output gate: Combined with sigmoid gates
Hidden state: Final hidden state activation
Information processing: Bounded information flow

Traditional Neural Networks Historical usage:

Hidden layers: Before ReLU became standard
Multi-layer perceptrons: Classic activation choice
Feature learning: Non-linear feature transformations
Universal approximation: Theoretical results

Normalization Contexts When bounded outputs are needed:

Feature normalization: Bounded feature values
Control signals: When outputs need bounds
Regularization: Implicit regularization through bounds
Stability: Preventing explosive activations

Disadvantages

Vanishing Gradients Primary limitation:

Saturation: Gradients vanish for large |x|
Deep networks: Problem compounds in deep architectures
Slow learning: Deep layers learn very slowly
Training difficulty: Harder to train very deep networks

Computational Cost More expensive than alternatives:

Exponential functions: Requires exponential computations
Hyperbolic functions: More complex than ReLU
Hardware: Less optimized than ReLU
Mobile deployment: Computational overhead

Saturation Problems Similar to sigmoid issues:

Large inputs: Gradients approach zero
Dead neurons: Neurons can get stuck
Initialization sensitivity: Poor initialization problematic
Learning stagnation: Training can plateau

Implementation

Standard Implementation Direct mathematical computation:

import numpy as np

def tanh(x):
    return np.tanh(x)  # Built-in function

# Or manual implementation
def tanh_manual(x):
    exp_2x = np.exp(2 * x)
    return (exp_2x - 1) / (exp_2x + 1)

Numerical Stability Avoiding overflow issues:

def tanh_stable(x):
    # Clip extreme values to prevent overflow
    x_clipped = np.clip(x, -500, 500)
    return np.tanh(x_clipped)

Efficient Gradient Computation Using function value for gradient:

def tanh_gradient(x):
    tanh_x = np.tanh(x)
    return 1 - tanh_x**2

Training Considerations

Initialization Weight initialization with tanh:

Xavier/Glorot: Designed specifically for tanh
Uniform distribution: [-√(6/(fan_in + fan_out)), √(6/(fan_in + fan_out))]
Normal distribution: std = √(2/(fan_in + fan_out))
Avoid poor initialization: Prevents saturation

Learning Rates Rate selection for tanh networks:

Moderate rates: Usually requires lower rates than ReLU
Adaptive optimizers: Adam, RMSprop work well
Scheduling: Learning rate decay helpful
Gradient clipping: May be necessary for RNNs

Regularization Combining with regularization techniques:

Dropout: Applied after tanh activation
Batch normalization: Can help with saturation
Weight decay: L2 regularization on weights
Gradient clipping: For RNN applications

Comparison with Other Activations

Tanh vs Sigmoid Key differences:

Range: tanh (-1,1) vs sigmoid (0,1)
Zero-centered: tanh yes, sigmoid no
Gradients: tanh stronger maximum gradient
Convergence: tanh often faster

Tanh vs ReLU Modern comparison:

Vanishing gradients: ReLU solves, tanh suffers
Computational cost: ReLU much faster
Sparsity: ReLU promotes sparsity, tanh doesn’t
Deep networks: ReLU enables deeper architectures

Tanh vs Modern Activations Against contemporary alternatives:

Smoothness: Tanh smooth, ReLU not
Performance: Modern activations often better
Specialized uses: Tanh still useful in specific contexts
Historical: Tanh important historically

Modern Usage

Specialized Applications Current tanh usage:

RNN gates: Combined with sigmoid in LSTM
Output normalization: When bounded outputs needed
Feature scaling: Explicit feature normalization
Control systems: Bounded control signals

Hybrid Approaches Combining with other functions:

ReLU-tanh: ReLU hidden, tanh output
Attention mechanisms: Tanh in some attention variants
Generative models: Output layers in GANs
Ensemble: Mixed activation strategies

Best Practices

When to Use Tanh Appropriate scenarios:

RNN hidden states: Traditional choice
Bounded outputs: When output range matters
Zero-centered needs: When centering important
Smooth gradients: When smoothness required

Implementation Guidelines

Use proper initialization schemes
Monitor for saturation during training
Consider numerical stability
Apply appropriate regularization

Training Strategies

Use moderate learning rates
Apply gradient clipping for RNNs
Monitor activation distributions
Consider batch normalization

Migration Strategies Moving from tanh to modern activations:

Start simple: Replace with ReLU first
Compare performance: Validate improvements
Gradual replacement: Layer-by-layer migration
Specialized uses: Keep tanh where appropriate

While tanh has been largely superseded by ReLU for general use, it remains valuable for specific applications requiring zero-centered, bounded activations, particularly in recurrent networks and output layers where its mathematical properties provide distinct advantages.