AI Term 6 min read

Tanh

Hyperbolic tangent activation function that maps inputs to outputs between -1 and 1, offering zero-centered outputs and smooth gradients.


Tanh (Hyperbolic Tangent)

Tanh (hyperbolic tangent) is an activation function that maps input values to outputs between -1 and 1. It’s essentially a scaled and shifted version of the sigmoid function, providing zero-centered outputs which can lead to more efficient learning in neural networks. Tanh was historically popular and remains useful in specific applications, particularly recurrent neural networks.

Mathematical Definition

Tanh Formula tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Alternative Forms

  • Exponential form: tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
  • Sigmoid relation: tanh(x) = 2σ(2x) - 1
  • Inverse relation: σ(x) = (1 + tanh(x/2)) / 2

Properties

  • Range: (-1, 1)
  • Zero-centered: Output centered around zero
  • Odd function: tanh(-x) = -tanh(x)
  • Smooth: Infinitely differentiable

Key Characteristics

S-Shaped Curve Similar to sigmoid but centered:

  • Steep transition: Around x = 0
  • Saturation: Approaches ±1 for large |x|
  • Zero-crossing: Passes through origin
  • Symmetry: Odd function symmetry

Zero-Centered Output Advantage over sigmoid:

  • Mean activation: Around zero
  • Efficient gradients: Faster convergence
  • Weight updates: More efficient learning
  • No bias: Reduces internal covariate shift

Derivative Properties Gradient characteristics:

  • Formula: tanh’(x) = 1 - tanh²(x)
  • Maximum: At x = 0 where tanh’(0) = 1
  • Range: (0, 1]
  • Self-referential: Derivative in terms of function value

Advantages

Zero-Centered Outputs Primary benefit over sigmoid:

  • Balanced activation: Positive and negative outputs
  • Gradient efficiency: More efficient weight updates
  • Faster convergence: Often converges faster than sigmoid
  • Reduced bias: Less internal covariate shift

Strong Gradients Better gradient properties:

  • Maximum gradient: 1 at origin (vs 0.25 for sigmoid)
  • Stronger learning signal: Facilitates learning
  • Less saturation: Around zero region
  • Better backpropagation: Stronger gradient flow

Smooth Non-Linearity Desirable mathematical properties:

  • Differentiable: Smooth everywhere
  • Bounded: Output range well-controlled
  • Monotonic: Preserves input ordering
  • Continuous: No discontinuities

Applications

Recurrent Neural Networks Traditional choice for RNNs:

  • Hidden states: RNN hidden state activation
  • Vanilla RNNs: Standard activation choice
  • Memory: Bounded activation helps with stability
  • Gradient flow: Better than sigmoid for sequences

LSTM Components Specific uses in LSTM cells:

  • Cell state: tanh for cell state candidates
  • Output gate: Combined with sigmoid gates
  • Hidden state: Final hidden state activation
  • Information processing: Bounded information flow

Traditional Neural Networks Historical usage:

  • Hidden layers: Before ReLU became standard
  • Multi-layer perceptrons: Classic activation choice
  • Feature learning: Non-linear feature transformations
  • Universal approximation: Theoretical results

Normalization Contexts When bounded outputs are needed:

  • Feature normalization: Bounded feature values
  • Control signals: When outputs need bounds
  • Regularization: Implicit regularization through bounds
  • Stability: Preventing explosive activations

Disadvantages

Vanishing Gradients Primary limitation:

  • Saturation: Gradients vanish for large |x|
  • Deep networks: Problem compounds in deep architectures
  • Slow learning: Deep layers learn very slowly
  • Training difficulty: Harder to train very deep networks

Computational Cost More expensive than alternatives:

  • Exponential functions: Requires exponential computations
  • Hyperbolic functions: More complex than ReLU
  • Hardware: Less optimized than ReLU
  • Mobile deployment: Computational overhead

Saturation Problems Similar to sigmoid issues:

  • Large inputs: Gradients approach zero
  • Dead neurons: Neurons can get stuck
  • Initialization sensitivity: Poor initialization problematic
  • Learning stagnation: Training can plateau

Implementation

Standard Implementation Direct mathematical computation:

import numpy as np

def tanh(x):
    return np.tanh(x)  # Built-in function

# Or manual implementation
def tanh_manual(x):
    exp_2x = np.exp(2 * x)
    return (exp_2x - 1) / (exp_2x + 1)

Numerical Stability Avoiding overflow issues:

def tanh_stable(x):
    # Clip extreme values to prevent overflow
    x_clipped = np.clip(x, -500, 500)
    return np.tanh(x_clipped)

Efficient Gradient Computation Using function value for gradient:

def tanh_gradient(x):
    tanh_x = np.tanh(x)
    return 1 - tanh_x**2

Training Considerations

Initialization Weight initialization with tanh:

  • Xavier/Glorot: Designed specifically for tanh
  • Uniform distribution: [-√(6/(fan_in + fan_out)), √(6/(fan_in + fan_out))]
  • Normal distribution: std = √(2/(fan_in + fan_out))
  • Avoid poor initialization: Prevents saturation

Learning Rates Rate selection for tanh networks:

  • Moderate rates: Usually requires lower rates than ReLU
  • Adaptive optimizers: Adam, RMSprop work well
  • Scheduling: Learning rate decay helpful
  • Gradient clipping: May be necessary for RNNs

Regularization Combining with regularization techniques:

  • Dropout: Applied after tanh activation
  • Batch normalization: Can help with saturation
  • Weight decay: L2 regularization on weights
  • Gradient clipping: For RNN applications

Comparison with Other Activations

Tanh vs Sigmoid Key differences:

  • Range: tanh (-1,1) vs sigmoid (0,1)
  • Zero-centered: tanh yes, sigmoid no
  • Gradients: tanh stronger maximum gradient
  • Convergence: tanh often faster

Tanh vs ReLU Modern comparison:

  • Vanishing gradients: ReLU solves, tanh suffers
  • Computational cost: ReLU much faster
  • Sparsity: ReLU promotes sparsity, tanh doesn’t
  • Deep networks: ReLU enables deeper architectures

Tanh vs Modern Activations Against contemporary alternatives:

  • Smoothness: Tanh smooth, ReLU not
  • Performance: Modern activations often better
  • Specialized uses: Tanh still useful in specific contexts
  • Historical: Tanh important historically

Modern Usage

Specialized Applications Current tanh usage:

  • RNN gates: Combined with sigmoid in LSTM
  • Output normalization: When bounded outputs needed
  • Feature scaling: Explicit feature normalization
  • Control systems: Bounded control signals

Hybrid Approaches Combining with other functions:

  • ReLU-tanh: ReLU hidden, tanh output
  • Attention mechanisms: Tanh in some attention variants
  • Generative models: Output layers in GANs
  • Ensemble: Mixed activation strategies

Best Practices

When to Use Tanh Appropriate scenarios:

  • RNN hidden states: Traditional choice
  • Bounded outputs: When output range matters
  • Zero-centered needs: When centering important
  • Smooth gradients: When smoothness required

Implementation Guidelines

  • Use proper initialization schemes
  • Monitor for saturation during training
  • Consider numerical stability
  • Apply appropriate regularization

Training Strategies

  • Use moderate learning rates
  • Apply gradient clipping for RNNs
  • Monitor activation distributions
  • Consider batch normalization

Migration Strategies Moving from tanh to modern activations:

  • Start simple: Replace with ReLU first
  • Compare performance: Validate improvements
  • Gradual replacement: Layer-by-layer migration
  • Specialized uses: Keep tanh where appropriate

While tanh has been largely superseded by ReLU for general use, it remains valuable for specific applications requiring zero-centered, bounded activations, particularly in recurrent networks and output layers where its mathematical properties provide distinct advantages.

← Back to Glossary