AI Term 5 min read

Distance

A mathematical measure of how far apart two objects, points, or vectors are in a given space, fundamental to many machine learning algorithms.


Distance

Distance is a mathematical concept that quantifies how far apart two objects, points, vectors, or data samples are in a given space. Distance metrics are fundamental to many machine learning algorithms, providing the basis for clustering, classification, similarity search, and optimization algorithms.

Mathematical Properties

Metric Properties A true distance metric must satisfy:

  • Non-negativity: d(x,y) ≥ 0
  • Identity: d(x,y) = 0 if and only if x = y
  • Symmetry: d(x,y) = d(y,x)
  • Triangle inequality: d(x,z) ≤ d(x,y) + d(y,z)

Pseudo-Metrics May violate some metric properties:

  • Semi-metrics: may not satisfy triangle inequality
  • Quasi-metrics: may not be symmetric
  • Still useful in many applications
  • Examples include KL divergence

Common Distance Metrics

Euclidean Distance (L2) Most common geometric distance:

  • d(x,y) = √(Σᵢ (xᵢ - yᵢ)²)
  • Natural intuitive interpretation
  • Sensitive to outliers and scale
  • Standard choice for continuous data

Manhattan Distance (L1) City block or taxicab distance:

  • d(x,y) = Σᵢ |xᵢ - yᵢ|
  • More robust to outliers
  • Good for high-dimensional sparse data
  • Used in LASSO regularization

Chebyshev Distance (L∞) Maximum coordinate difference:

  • d(x,y) = maxᵢ |xᵢ - yᵢ|
  • Useful for game theory and optimization
  • Less common in machine learning
  • Captures worst-case differences

Specialized Distance Metrics

Cosine Distance Based on angle between vectors:

  • d(x,y) = 1 - cos(θ) = 1 - (x·y)/(||x|| × ||y||)
  • Ignores vector magnitude
  • Popular for text and high-dimensional data
  • Related to cosine similarity

Hamming Distance For categorical or binary data:

  • d(x,y) = number of positions where x and y differ
  • Used for error-correcting codes
  • Good for binary feature vectors
  • Applications in genetics and strings

Mahalanobis Distance Accounts for data covariance:

  • d(x,y) = √((x-y)ᵀ S⁻¹ (x-y))
  • Where S is covariance matrix
  • Normalizes for feature correlations
  • Useful when features have different scales

String and Sequence Distances

Levenshtein Distance Edit distance for strings:

  • Minimum edits (insertions, deletions, substitutions)
  • Used in spell checking and DNA analysis
  • Dynamic programming computation
  • Extends to sequence alignment

Jaccard Distance For set-based data:

  • d(A,B) = 1 - |A ∩ B| / |A ∪ B|
  • Good for sparse binary features
  • Used in recommendation systems
  • Complement of Jaccard similarity

Applications

Clustering Algorithms Distance-based grouping:

  • K-means uses Euclidean distance
  • Hierarchical clustering with various metrics
  • DBSCAN with ε-neighborhoods
  • Distance determines cluster membership

Nearest Neighbor Methods Classification and regression:

  • k-NN uses distance to find neighbors
  • Instance-based learning
  • Anomaly detection via distance
  • Local prediction based on proximity

Similarity Search Information retrieval applications:

  • Document similarity via TF-IDF distance
  • Image retrieval with feature distances
  • Music recommendation systems
  • Product recommendation engines

Computational Considerations

Curse of Dimensionality High-dimensional distance problems:

  • All points become equidistant
  • Euclidean distance loses meaning
  • Need specialized techniques
  • Consider dimensionality reduction

Efficiency Optimization Faster distance computation:

  • Approximate nearest neighbor algorithms
  • Locality-sensitive hashing (LSH)
  • Tree-based indexing (KD-trees, Ball trees)
  • GPU acceleration for parallel computation

Scalability Issues Large-scale distance computation:

  • O(n²) all-pairs distance complexity
  • Sparse data optimizations
  • Distributed computation strategies
  • Sampling and approximation methods

Distance in Different Domains

Computer Vision Image and feature comparison:

  • Pixel-wise Euclidean distance
  • Histogram intersection distance
  • Structural similarity metrics
  • Deep feature distances

Natural Language Processing Text and semantic distances:

  • Edit distances for string matching
  • Semantic distances using embeddings
  • Topic model distances
  • Syntactic tree distances

Bioinformatics Sequence and structure comparison:

  • DNA/protein sequence alignment
  • Phylogenetic distances
  • Structure-based distances
  • Evolutionary distances

Choosing Distance Metrics

Data Type Considerations

  • Continuous: Euclidean, Manhattan, Mahalanobis
  • Binary: Hamming, Jaccard
  • Categorical: Hamming with encoding
  • Mixed: Gower distance or preprocessing

Problem-Specific Factors

  • Outlier sensitivity requirements
  • Scale invariance needs
  • Computational efficiency constraints
  • Interpretability importance

Domain Knowledge

  • Feature importance weighting
  • Custom distance functions
  • Metric learning approaches
  • Expert knowledge incorporation

Distance Learning

Metric Learning Learning optimal distance functions:

  • Supervised distance metric learning
  • Contrastive learning approaches
  • Triplet loss optimization
  • Large margin nearest neighbor (LMNN)

Deep Distance Learning Neural network-based distances:

  • Siamese networks for similarity
  • Triplet networks for ranking
  • Learned embeddings with distance
  • End-to-end distance optimization

Evaluation and Validation

Distance Quality Assessment

  • Clustering quality metrics
  • Nearest neighbor accuracy
  • Retrieval precision and recall
  • Human judgment correlation

Robustness Testing

  • Sensitivity to outliers
  • Stability across data subsets
  • Performance on noisy data
  • Generalization to new domains

Best Practices

Preprocessing Recommendations

  • Feature scaling and normalization
  • Outlier detection and handling
  • Missing value imputation
  • Dimensionality reduction when needed

Implementation Guidelines

  • Choose metrics appropriate for data type
  • Consider computational constraints
  • Validate against domain knowledge
  • Test multiple metrics when uncertain

Performance Monitoring

  • Track distance distribution changes
  • Monitor computational performance
  • Validate continued relevance
  • Update distance functions as needed

Understanding distance metrics is essential for machine learning practitioners, as they form the foundation for measuring relationships in data and enabling algorithms to make proximity-based decisions across diverse applications.

← Back to Glossary