Hyperbolic tangent activation function that maps inputs to outputs between -1 and 1, offering zero-centered outputs and smooth gradients.
Tanh (Hyperbolic Tangent)
Tanh (hyperbolic tangent) is an activation function that maps input values to outputs between -1 and 1. It’s essentially a scaled and shifted version of the sigmoid function, providing zero-centered outputs which can lead to more efficient learning in neural networks. Tanh was historically popular and remains useful in specific applications, particularly recurrent neural networks.
Mathematical Definition
Tanh Formula tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Alternative Forms
- Exponential form: tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
- Sigmoid relation: tanh(x) = 2σ(2x) - 1
- Inverse relation: σ(x) = (1 + tanh(x/2)) / 2
Properties
- Range: (-1, 1)
- Zero-centered: Output centered around zero
- Odd function: tanh(-x) = -tanh(x)
- Smooth: Infinitely differentiable
Key Characteristics
S-Shaped Curve Similar to sigmoid but centered:
- Steep transition: Around x = 0
- Saturation: Approaches ±1 for large |x|
- Zero-crossing: Passes through origin
- Symmetry: Odd function symmetry
Zero-Centered Output Advantage over sigmoid:
- Mean activation: Around zero
- Efficient gradients: Faster convergence
- Weight updates: More efficient learning
- No bias: Reduces internal covariate shift
Derivative Properties Gradient characteristics:
- Formula: tanh’(x) = 1 - tanh²(x)
- Maximum: At x = 0 where tanh’(0) = 1
- Range: (0, 1]
- Self-referential: Derivative in terms of function value
Advantages
Zero-Centered Outputs Primary benefit over sigmoid:
- Balanced activation: Positive and negative outputs
- Gradient efficiency: More efficient weight updates
- Faster convergence: Often converges faster than sigmoid
- Reduced bias: Less internal covariate shift
Strong Gradients Better gradient properties:
- Maximum gradient: 1 at origin (vs 0.25 for sigmoid)
- Stronger learning signal: Facilitates learning
- Less saturation: Around zero region
- Better backpropagation: Stronger gradient flow
Smooth Non-Linearity Desirable mathematical properties:
- Differentiable: Smooth everywhere
- Bounded: Output range well-controlled
- Monotonic: Preserves input ordering
- Continuous: No discontinuities
Applications
Recurrent Neural Networks Traditional choice for RNNs:
- Hidden states: RNN hidden state activation
- Vanilla RNNs: Standard activation choice
- Memory: Bounded activation helps with stability
- Gradient flow: Better than sigmoid for sequences
LSTM Components Specific uses in LSTM cells:
- Cell state: tanh for cell state candidates
- Output gate: Combined with sigmoid gates
- Hidden state: Final hidden state activation
- Information processing: Bounded information flow
Traditional Neural Networks Historical usage:
- Hidden layers: Before ReLU became standard
- Multi-layer perceptrons: Classic activation choice
- Feature learning: Non-linear feature transformations
- Universal approximation: Theoretical results
Normalization Contexts When bounded outputs are needed:
- Feature normalization: Bounded feature values
- Control signals: When outputs need bounds
- Regularization: Implicit regularization through bounds
- Stability: Preventing explosive activations
Disadvantages
Vanishing Gradients Primary limitation:
- Saturation: Gradients vanish for large |x|
- Deep networks: Problem compounds in deep architectures
- Slow learning: Deep layers learn very slowly
- Training difficulty: Harder to train very deep networks
Computational Cost More expensive than alternatives:
- Exponential functions: Requires exponential computations
- Hyperbolic functions: More complex than ReLU
- Hardware: Less optimized than ReLU
- Mobile deployment: Computational overhead
Saturation Problems Similar to sigmoid issues:
- Large inputs: Gradients approach zero
- Dead neurons: Neurons can get stuck
- Initialization sensitivity: Poor initialization problematic
- Learning stagnation: Training can plateau
Implementation
Standard Implementation Direct mathematical computation:
import numpy as np
def tanh(x):
return np.tanh(x) # Built-in function
# Or manual implementation
def tanh_manual(x):
exp_2x = np.exp(2 * x)
return (exp_2x - 1) / (exp_2x + 1)
Numerical Stability Avoiding overflow issues:
def tanh_stable(x):
# Clip extreme values to prevent overflow
x_clipped = np.clip(x, -500, 500)
return np.tanh(x_clipped)
Efficient Gradient Computation Using function value for gradient:
def tanh_gradient(x):
tanh_x = np.tanh(x)
return 1 - tanh_x**2
Training Considerations
Initialization Weight initialization with tanh:
- Xavier/Glorot: Designed specifically for tanh
- Uniform distribution: [-√(6/(fan_in + fan_out)), √(6/(fan_in + fan_out))]
- Normal distribution: std = √(2/(fan_in + fan_out))
- Avoid poor initialization: Prevents saturation
Learning Rates Rate selection for tanh networks:
- Moderate rates: Usually requires lower rates than ReLU
- Adaptive optimizers: Adam, RMSprop work well
- Scheduling: Learning rate decay helpful
- Gradient clipping: May be necessary for RNNs
Regularization Combining with regularization techniques:
- Dropout: Applied after tanh activation
- Batch normalization: Can help with saturation
- Weight decay: L2 regularization on weights
- Gradient clipping: For RNN applications
Comparison with Other Activations
Tanh vs Sigmoid Key differences:
- Range: tanh (-1,1) vs sigmoid (0,1)
- Zero-centered: tanh yes, sigmoid no
- Gradients: tanh stronger maximum gradient
- Convergence: tanh often faster
Tanh vs ReLU Modern comparison:
- Vanishing gradients: ReLU solves, tanh suffers
- Computational cost: ReLU much faster
- Sparsity: ReLU promotes sparsity, tanh doesn’t
- Deep networks: ReLU enables deeper architectures
Tanh vs Modern Activations Against contemporary alternatives:
- Smoothness: Tanh smooth, ReLU not
- Performance: Modern activations often better
- Specialized uses: Tanh still useful in specific contexts
- Historical: Tanh important historically
Modern Usage
Specialized Applications Current tanh usage:
- RNN gates: Combined with sigmoid in LSTM
- Output normalization: When bounded outputs needed
- Feature scaling: Explicit feature normalization
- Control systems: Bounded control signals
Hybrid Approaches Combining with other functions:
- ReLU-tanh: ReLU hidden, tanh output
- Attention mechanisms: Tanh in some attention variants
- Generative models: Output layers in GANs
- Ensemble: Mixed activation strategies
Best Practices
When to Use Tanh Appropriate scenarios:
- RNN hidden states: Traditional choice
- Bounded outputs: When output range matters
- Zero-centered needs: When centering important
- Smooth gradients: When smoothness required
Implementation Guidelines
- Use proper initialization schemes
- Monitor for saturation during training
- Consider numerical stability
- Apply appropriate regularization
Training Strategies
- Use moderate learning rates
- Apply gradient clipping for RNNs
- Monitor activation distributions
- Consider batch normalization
Migration Strategies Moving from tanh to modern activations:
- Start simple: Replace with ReLU first
- Compare performance: Validate improvements
- Gradual replacement: Layer-by-layer migration
- Specialized uses: Keep tanh where appropriate
While tanh has been largely superseded by ReLU for general use, it remains valuable for specific applications requiring zero-centered, bounded activations, particularly in recurrent networks and output layers where its mathematical properties provide distinct advantages.