Cosine similarity in statistics.


In statistics world cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes). Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The smaller the angle between two vectors, the higher the cosine similarity. The cosine similarity depends only on the angle between the two non-zero vectors, but not on their magnitudes.

Cosine similarity with Python.


The cosine similarity can be used together with classical correlation to undersatnd relation between data, results are very close when data are equally mean-centered and differ otherwise, intuition between calculated values are very similar to classical correlation.



Cosine similarity VS classical correlation:



import matplotlib.pyplot as plt
import numpy as np
from scipy import spatial

# range of requested correlation coefficients
rs = np.linspace(-1,1,100)

# sample size
N = 500


# initialize output matrix
corrs = np.zeros((len(rs),2))


# loop over a range of r values
for ri in range(len(rs)):
    
    # generate data
    x = np.random.randn(N)
    y = x*rs[ri] + np.random.randn(N)*np.sqrt(1-rs[ri]**2)
    
    # optional mean-off-centering
    x = x+1
    #y = y+10
    
    
    # compute correlation
    corrs[ri,0] = np.corrcoef(x,y)[0,1]
    
    # compute cosine similarity
    cs_num = sum(x*y)
    cs_den = np.sqrt(sum(x*x)) * np.sqrt(sum(y*y))
    corrs[ri,1] = cs_num / cs_den
    
    # using built-in distance function
    #corrs[ri,1] = 1-spatial.distance.cosine(x,y)

  ## visualize the results

plt.plot(rs,corrs[:,0],'s-',label='Correlation')
plt.plot(rs,corrs[:,1],'s-',label='Cosine sim.')
plt.legend()
plt.xlabel('Requested correlation')
plt.ylabel('Empirical correlation')
plt.axis('square')
plt.show()


plt.plot(corrs[:,0],corrs[:,1],'ks')
plt.axis('square')
plt.xlabel('Correlation')
plt.ylabel('Cosine similarity')
plt.show()

Cosine similarity VS classical correlation


Correlation between Cosine similarity and classical correlation:



# their empirical correlation
np.corrcoef(corrs.T)

OUT:
array([[1. , 0.99772192],
[0.99772192, 1. ]])




See also related topics: