From Data to Discovery: The Scikit-Learn Advantage.

What is Scikit Learn ?


If you’re dealing with Machine Learning in Python, Scikit Learn is considered as the gold standard.
Scikit-learn is an open source Python library that provides a wide selection of supervised and unsupervised learning algorithms. It implements a range of machine learning tools performing preprocessing, cross-validation and visualization using a unified interface.

Scikit-Learn DS cheat sheet.
Scikit-Learn for Data Science meme.

Python Knowledge Base: Make coding great again.
- Updated: 2025-01-02 by Andrey BRATUS, Senior Data Analyst.




    Scikit-learn rich set of algorithm offerings includes Regression, Clustering, Decision Trees, Neural Networks, SVMs and Naive Bayes. Corresponding use cases are presented in other sections of this site.


  1. Initial Data load.


  2. Initial input data should be numeric and stored as NumPy arrays, Pandas DataFrame or SciPy sparse matrices.

    
    import numpy as np
    import pandas as pd
    set=pd.read_excel('file.xlsx')
    

  3. Training And Test Data.


  4. 
    from sklearn.model_selection import train_test_split
    X = set [[ 'feature1', 'feature1', 'feature1', #...
                    ]]
    
    y = set ['target']
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3 , random_state = 103)
    

    Preprocessing The Data - Standardization.


    
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler().fit(X_train)
    standardized_X = scaler.transform(X_train)
    standardized_X_test = scaler.transform(X_test)
    
  5. Preprocessing The Data - Normalization.


  6. 
    from sklearn.preprocessing import Normalizer
    scaler = Normalizer().fit(X_train)
    normalized_X = scaler.transform(X_train)
    normalized_X_test = scaler.transform(X_test)
    

  7. Preprocessing The Data - Binarization.


  8. 
    from sklearn.preprocessing import Binarizer
    binarizer = Binarizer(threshold=0.0).fit(X)
    binary_X = binarizer.transform(X)
    
  9. Preprocessing The Data - Encoding Categorical Features.


  10. 
    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    y = enc.fit_transform(y)
    
  11. Preprocessing The Data - Imputing Missing Values.


  12. 
    from sklearn.preprocessing import Imputer
    imp = Imputer(missing_values=0, strategy='mean', axis=0)
    imp.fit_transform(X_train)
    
  13. Preprocessing The Data - Generating Polynomial Features.


  14. 
    from sklearn.preprocessing import PolynomialFeatures
    poly = PolynomialFeatures(3)
    poly.fit_transform(X)
    
  15. Creating a Model - Supervised Learning.


  16. 
    #Linear Regression
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression(normalize=True)
    #Support Vector Machines (SVM)
    from sklearn.svm import SVC
    svc = SVC(kernel='linear')
    #Naive Bayes
    from sklearn.naive_bayes import GaussianNB
    gnb = GaussianNB()
    #KNN
    from sklearn import neighbors
    knn = neighbors.KNeighborsClassifier(n_neighbors=3)
    
  17. Model Fitting.


  18. 
    #Supervised learning
    lr.fit(X, y)
    knn.fit(X_train, y_train)
    svc.fit(X_train, y_train)
    #Unsupervised Learning
    k_means.fit(X_train)
    pca_model = pca.fit_transform(X_train)
    
  19. Prediction.


  20. 
    #Supervised Estimators
    y_pred = svc.predict(np.random.random((2,5)))
    y_pred = lr.predict(X_test)
    y_pred = knn.predict_proba(X_test)
    #Unsupervised Estimators
    y_pred = k_means.predict(X_test)
    
  21. Evaluating Model’s Performance - Classification Metrics.


  22. 
    #Accuracy Score
    knn.score(X_test, y_test)
    from sklearn.metrics import accuracy_score
    accuracy_score(y_test, y_pred)
    #Classification Report
    from sklearn.metrics import classification_report
    print(classification_report(y_test, y_pred))
    #Confusion Matrix
    from sklearn.metrics import confusion_matrix
    print(confusion_matrix(y_test, y_pred))
    
  23. Evaluating Model’s Performance - Regression Metrics.


  24. 
    #Mean Absolute Error
    from sklearn.metrics import mean_absolute_error
    y_true = [3, -0.5, 2]
    mean_absolute_error(y_true, y_pred)
    #Mean Squared Error
    from sklearn.metrics import mean_squared_error
    mean_squared_error(y_test, y_pred)
    #R² Score
    from sklearn.metrics import r2_score
    r2_score(y_true, y_pred)
    
  25. Evaluating Model’s Performance - Clustering Metrics.


  26. 
    #Adjusted Rand Index
    from sklearn.metrics import adjusted_rand_score
    adjusted_rand_score(y_true, y_pred)
    #Homogeneity
    from sklearn.metrics import homogeneity_score
    homogeneity_score(y_true, y_pred)
    #V-measure
    from sklearn.metrics import v_measure_score
    metrics.v_measure_score(y_true, y_pred)
    
  27. Evaluating Model’s Performance - Cross-Validation.


  28. 
    from sklearn.cross_validation import cross_val_score
    print(cross_val_score(knn, X_train, y_train, cv=4))
    print(cross_val_score(lr, X, y, cv=2))
    
  29. Model Tuning - Grid Search.


  30. 
    from sklearn.grid_search import GridSearchCV
    params = {"n_neighbors": np.arange(1,3),
    "metric": ["euclidean", "cityblock"]}
    grid = GridSearchCV(estimator=knn,
    param_grid=params)
    grid.fit(X_train, y_train)
    print(grid.best_score_)
    print(grid.best_estimator_.n_neighbors)
    
  31. Model Tuning - Randomized Parameter Optimization.


  32. 
    from sklearn.grid_search import RandomizedSearchCV
    params = {"n_neighbors": range(1,5),
    "weights": ["uniform", "distance"]}
    rsearch = RandomizedSearchCV(estimator=knn,
    param_distributions=params,
    cv=4,
    n_iter=8,
    random_state=5)
    rsearch.fit(X_train, y_train)
    print(rsearch.best_score_)
    



See also related topics: