Cosine similarity in statistics.
In statistics world cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes). Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The smaller the angle between two vectors, the higher the cosine similarity. The cosine similarity depends only on the angle between the two non-zero vectors, but not on their magnitudes.
The cosine similarity can be used together with classical correlation to undersatnd relation between data, results are very close when data are equally mean-centered and differ otherwise, intuition between calculated values are very similar to classical correlation.
Cosine similarity VS classical correlation:
import matplotlib.pyplot as plt import numpy as np from scipy import spatial # range of requested correlation coefficients rs = np.linspace(-1,1,100) # sample size N = 500 # initialize output matrix corrs = np.zeros((len(rs),2)) # loop over a range of r values for ri in range(len(rs)): # generate data x = np.random.randn(N) y = x*rs[ri] + np.random.randn(N)*np.sqrt(1-rs[ri]**2) # optional mean-off-centering x = x+1 #y = y+10 # compute correlation corrs[ri,0] = np.corrcoef(x,y)[0,1] # compute cosine similarity cs_num = sum(x*y) cs_den = np.sqrt(sum(x*x)) * np.sqrt(sum(y*y)) corrs[ri,1] = cs_num / cs_den # using built-in distance function #corrs[ri,1] = 1-spatial.distance.cosine(x,y) ## visualize the results plt.plot(rs,corrs[:,0],'s-',label='Correlation') plt.plot(rs,corrs[:,1],'s-',label='Cosine sim.') plt.legend() plt.xlabel('Requested correlation') plt.ylabel('Empirical correlation') plt.axis('square') plt.show() plt.plot(corrs[:,0],corrs[:,1],'ks') plt.axis('square') plt.xlabel('Correlation') plt.ylabel('Cosine similarity') plt.show()
Correlation between Cosine similarity and classical correlation:
# their empirical correlation np.corrcoef(corrs.T)
array([[1. , 0.99772192],
[0.99772192, 1. ]])