Data Science Starts with NumPy.

Why NumPy is important ?


To start data science journey with Python we must be armed with Numpy library. NumPy (or Numpy) is a Linear Algebra Library is a fundamental package for data analysis in Python. It provides an efficient way to work with large arrays and matrices of numerical data. One of the main advantages of using NumPy for data science is its speed and efficiency. NumPy is written in C and allows for vectorized operations, making it much faster than traditional Python loops.

NumPy DS cheat sheet.
Data Science with NumPy meme.

Python Knowledge Base: Make coding great again.
- Updated: 2024-07-26 by Andrey BRATUS, Senior Data Analyst.




    Additionally, NumPy has a vast array of mathematical functions and operations, including linear algebra and Fourier transforms, making it a powerful tool for data analysis. NumPy also integrates seamlessly with other Python libraries, such as Pandas and Matplotlib, to provide a complete data analysis toolkit.

    NumPy is also highly adaptable, being used in a wide range of fields beyond data science, such as physics, engineering, and finance. The package is constantly evolving, with new features and functions being added regularly. NumPy's open-source nature and large community of developers ensure its continued growth and improvement. NumPy is also highly compatible with other data science tools, such as Jupyter Notebooks and Anaconda, making it a popular choice for data science projects.

    Looking into the future, NumPy is expected to continue to play a critical role in data science and analytics. As the amount of data generated by businesses and organizations continues to grow, the need for efficient and scalable data analysis tools will only increase. NumPy's speed and efficiency make it a top choice for handling large datasets and performing complex mathematical operations. Additionally, the development of new libraries and tools specifically designed for data science, such as TensorFlow and PyTorch, will continue to expand NumPy's capabilities. Overall, NumPy's adaptability, efficiency, and versatility make it a reliable choice for data science projects both now and in the future.

    It is highly recommended you install Python using the Anaconda distribution to make sure all underlying dependencies all sync up with the use of a conda install. Once you've installed NumPy you can import it as a library:


    
    conda install numpy
    
    import numpy as np
    


  1. Numpy Input/Output:


  2. 
    a=np.arange(0,10)
    #Save an array to a binary file in NumPy ``.npy`` format.
    np.save('my_array',a)
    #Load arrays or pickled objects from ``.npy``, ``.npz`` or pickled files.
    np.load('my_array.npy')
    #Load data from a text file.
    np.loadtxt('myfile.txt')
    # Load data from a text file, with missing values handled as specified.
    np.genfromtxt('myfile.csv', delimeter=',')
    # Save an array to a text file.
    np.savetxt('myarray.txt', a, delimeter=' ')
    

  3. Creating Arrays:


  4. Creating 1D array.

    
    d1 = np.array([2,3,5])
    d1
    

    OUT: array([2, 3, 5])


    Creating 2D array.

    
    d2 = np.array([(2.2,3,5), (1.4,6,5)], dtype=float)
    d2
    

    OUT: array([[2.2, 3. , 5. ],
    [1.4, 6. , 5. ]])


    Creating 3D array.

    
    d3 = np.array([[(2.2,3,5), (1.4,6,5)], [(2.2,3,5), (1.4,6,5)]], dtype=float)
    d3
    

    OUT: array([[[2.2, 3. , 5. ],
    [1.4, 6. , 5. ]],

    [[2.2, 3. , 5. ],
    [1.4, 6. , 5. ]]])



    Creating an array of zeros.

    
    z = np.zeros((3,3))
    z 
    

    OUT: array([[0., 0., 0.],
    [0., 0., 0.],
    [0., 0., 0.]])


    Creating an array of ones.

    
    one = np.ones((3,3))
    one
    

    OUT: array([[1., 1., 1.],
    [1., 1., 1.],
    [1., 1., 1.]])


    Creating an array of evenely spaced values with step.

    
    evenly = np.arange(5,50,5)
    evenly
    

    OUT: array([ 5, 10, 15, 20, 25, 30, 35, 40, 45])


    Creating an array of evenely spaced values defining number of samples.

    
    evenlyn = np.linspace(0,40,5)
    evenlyn
    

    OUT: array([ 0., 10., 20., 30., 40.])


    Creating identity (eye) matrix.

    
    np.eye(4)
    

    OUT: array([[ 1., 0., 0., 0.],
    [ 0., 1., 0., 0.],
    [ 0., 0., 1., 0.],
    [ 0., 0., 0., 1.]])


    Creating constant array.

    
    np.full((2,2), 3)
    

    OUT: array([[3, 3],
    [3, 3]])


    Creating empty array of given shape without initializing entries.

    
    empty1=np.empty([2,2])
    empty1
    

    OUT: array([[2.49707101e-316, 7.22619400e+165],
    [2.57217278e+151, 2.90154876e+183]])


    Creating an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

    
    np.random.rand(2)
    

    OUT: array([ 0.11570539, 0.35279769])


    Returns a sample (or samples) from the "standard normal" distribution. Unlike rand which is uniform:

    
    np.random.randn(5,5)
    

    OUT: array([[ 0.70154515, 0.22441999, 1.33563186, 0.82872577, -0.28247509],
    [ 0.64489788, 0.61815094, -0.81693168, -0.30102424, -0.29030574],
    [ 0.8695976 , 0.413755 , 2.20047208, 0.17955692, -0.82159344],
    [ 0.59264235, 1.29869894, -1.18870241, 0.11590888, -0.09181687],
    [-0.96924265, -1.62888685, -2.05787102, -0.29705576, 0.68915542]])


    Returns a random integers from low (inclusive) to high (exclusive).

    
    np.random.randint(1,100,10)
    

    OUT: np.random.randint(1,100,10)


  5. Array Attributes and Methods:


  6. Lets organize initial arrays first.

    
    arr = np.arange(25)
    ranarr = np.random.randint(0,50,10)
    arr
    

    OUT: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
    17, 18, 19, 20, 21, 22, 23, 24])


    
    ranarr
    

    OUT: array([10, 12, 41, 17, 49, 2, 46, 3, 19, 39])


    Reshape - returns an array containing the same data with a new shape.

    
    arr.reshape(5,5)
    

    OUT: array([[ 0, 1, 2, 3, 4],
    [ 5, 6, 7, 8, 9],
    [10, 11, 12, 13, 14],
    [15, 16, 17, 18, 19],
    [20, 21, 22, 23, 24]])


    max,min,argmax,argmin

    
    ranarr.max()
    

    OUT: 49


    
    ranarr.argmax()
    

    OUT: 8


    
    ranarr.min()
    

    OUT: 4


    
    ranarr.argmin()
    

    OUT: 1


    Shape is an attribute that arrays have (not a method):

    
    # Vector
    arr.shape
    

    OUT: (25,)


    
    arr.reshape(1,25)
    

    OUT: array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
    17, 18, 19, 20, 21, 22, 23, 24]])


    
    arr.reshape(1,25).shape
    

    OUT: (1, 25)


    
    arr.reshape(25,1)
    

    OUT: array([[ 0],
    [ 1],
    [ 2],
    [ 3],
    [ 4],
    [ 5],
    [ 6],
    [ 7],
    [ 8],
    [ 9],
    [10],
    [11],
    [12],
    [13],
    [14],
    [15],
    [16],
    [17],
    [18],
    [19],
    [20],
    [21],
    [22],
    [23],
    [24]])


    
    arr.reshape(25,1).shape
    

    OUT: (25, 1)


    You can grab the data type of the object in the array:


    
    arr.dtype
    

    OUT: dtype('int64')


  7. NumPy Indexing and Selection:


  8. Lets organize initial arrays first.

    
    arr = np.arange(0,11)
    arr
    

    OUT: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


    The simplest way to pick one or some elements of an array looks very similar to python lists:

    
    arr[8]
    

    OUT: 8


    
    arr[1:5]
    

    OUT: array([1, 2, 3, 4])


    
    arr[0:5]
    

    OUT: array([0, 1, 2, 3, 4])


    Broadcasting

    
    arr[0:5]=100
    arr
    

    OUT: array([100, 100, 100, 100, 100, 5, 6, 7, 8, 9, 10])


    Indexing a 2D array (matrices).

    
    arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
    arr_2d
    

    OUT: array([[ 5, 10, 15],
    [20, 25, 30],
    [35, 40, 45]])


    
    #Indexing row
    arr_2d[1]
    

    OUT: array([20, 25, 30])


    
    # Format is arr_2d[row][col] or arr_2d[row,col]
    # Getting individual element value
    arr_2d[1][0]
    

    OUT: 20


    
    # Getting individual element value
    arr_2d[1,0]
    

    OUT: 20


    
    # 2D array slicing
    #Shape (2,2) from top right corner
    arr_2d[:2,1:]
    

    OUT: array([[10, 15],
    [25, 30]])


    
    #Shape bottom row
    arr_2d[2]
    

    OUT: array([35, 40, 45])


    
    #Shape bottom row
    arr_2d[2,:]
    

    OUT: array([35, 40, 45])


    Fancy Indexing.

    
    #Set up matrix
    arr2d = np.zeros((10,10))
    
    #Length of array
    arr_length = arr2d.shape[1]
    
    #Set up array
    
    for i in range(arr_length):
        arr2d[i] = i
        
    arr2d
    

    OUT: array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
    [ 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
    [ 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
    [ 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
    [ 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
    [ 7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
    [ 8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
    [ 9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])


    
    arr2d[[2,4,6,8]]
    

    OUT: array([[ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
    [ 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
    [ 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
    [ 8., 8., 8., 8., 8., 8., 8., 8., 8., 8.]])


    
    #Allows in any order
    arr2d[[6,4,2,7]]
    

    OUT: array([[ 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
    [ 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
    [ 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
    [ 7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])


    Selection.

    
    arr = np.arange(1,11)
    arr
    

    OUT: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])


    
    arr > 4
    

    OUT: array([False, False, False, False, True, True, True, True, True, True], dtype=bool)


    
    bool_arr = arr>4
    bool_arr
    

    OUT: array([False, False, False, False, True, True, True, True, True, True], dtype=bool)


    
    arr[bool_arr]
    

    OUT: array([ 5, 6, 7, 8, 9, 10])


    
    arr[arr>2]
    

    OUT: array([ 3, 4, 5, 6, 7, 8, 9, 10])


  9. NumPy Operations:


  10. You can easily perform array with array arithmetic, or scalar with array arithmetic.

    
    arr = np.arange(0,10)
    arr + arr
    

    OUT: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])


    
    arr * arr
    

    OUT: array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])


    
    arr - arr
    

    OUT: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


    
    # Warning on division by zero, an error replaced with nan
    arr/arr
    

    OUT: /Users/user/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in true_divide
    if __name__ == '__main__':
    array([ nan, 1., 1., 1., 1., 1., 1., 1., 1., 1.])


    
    # The same in case of infinity 
    1/arr
    

    OUT: /Users/user/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in true_divide
    if __name__ == '__main__':
    array([ inf, 1. , 0.5 , 0.33333333, 0.25 ,
    0.2 , 0.16666667, 0.14285714, 0.125 , 0.11111111])


    
    arr**3
    

    OUT: array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])


  11. Universal Array Functions:


  12. Taking Square Roots.

    
    np.sqrt(arr)
    

    OUT: array([ 0. , 1. , 1.41421356, 1.73205081, 2. ,
    2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])


    Calcualting exponential (e^).

    
    np.exp(arr)
    

    OUT: array([ 1.00000000e+00, 2.71828183e+00, 7.38905610e+00,
    2.00855369e+01, 5.45981500e+01, 1.48413159e+02,
    4.03428793e+02, 1.09663316e+03, 2.98095799e+03, 8.10308393e+03])


    
    np.max(arr) 
    #same as arr.max()
    

    OUT: 9


    
    np.sin(arr)
    

    OUT: array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
    -0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])


    
    np.log(arr)
    

    OUT: /Users/user/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in log
    if __name__ == '__main__':
    array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436,
    1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])





See also related topics: