Data set Prep Powerhouse: Unleash the Potential of SQL.

Data journey with SQL.


After using SQL (structured query language) for data analysis let's dive deeper in data handling process and concentrate on dataset preparation task.

We will see how describe input data, work with attributes, handle missing data, find and filter outliers, encode primary data, split and combine datasets.

SQL dataset preparation.
SQL dataset preparation meme.

Python Knowledge Base: Make coding great again.
- Updated: 2024-09-12 by Andrey BRATUS, Senior Data Analyst.




    Although SQL is not considered as modern tool as Python for data handling, but it gives you nice to have additional instrumets in your data exploration journey.


  1. Dataset description:


  2. Volume - number of rows.

    
    SELECT COUNT(*) FROM table1;
    

    Rows by date.

    
    SELECT start_date, COUNT(*) FROM table1
    GROUP BY start_date
    ORDER BY start_date desc;
    

    Attributes selection.

    
    SELECT col1, col2, col3 FROM table1;
    

    Incomplete records.

    
    SELECT * FROM table1
    WHERE col1 IS NULL
    AND col2 IS NULL
    AND col3 IS NULL;
    


  3. Working with attributes:


  4. Distinct values.

    
    SELECT DISTINCT(col1) FROM table1;
    

    Missing values.

    
    SELECT * FROM table1
    WHERE col1 IS NULL;
    

    Column Range.

    
    SELECT MIN(col1), MAX(col1), AVG(col1) FROM table1;
    

    Data type.

    
    SELECT
        column_name, data_type, data_length
    FROM
        all_tab_columns
    WHERE
        upper(table_name) = 'TABLE1'
    

    Outliers filtering with 95% confidence.

    
    WITH outliers as (
    SELECT STDDEV(col2) as STDEV FROM table1)
    SELECT col1, col2 from table1 a
    CROSS JOIN outliers c
    WHERE a.col2>c.STDEV*2;
    

    Equiwidth Histogram / Distribution.

    
    SELECT col2,
    WIDTH_BUCKET(col2,100,400,10)
    FROM table1;
    


    Converting Data types.

    
    SELECT CAST(col1 as DATE),
    CAST(col2 as INT) FROM table1;
    

    Replacing column values by pattern.

    
    SELECT reg,CASE WHEN col1 is NULL then
    REPLACE(col2,0,100) ELSE to_char(col2) END col2 FROM table1;
    

    Data reFormatting.

    
    SELECT UPPER(col1), REPLACE(col2,',','_') FROM table1;
    

    Scaling.

    
    SELECT col3, col2/(MAX(col2)OVER
    (PARTITION BY col3)) from table1
    WHERE col2<>0;
    

    Buckets/Bins.

    
    SELECT col1,
    CASE WHEN col3<=100 THEN 'bin1'
     WHEN col3>100 THEN 'bin12'
       ELSE 'bin3' END as col3_bin
         FROM table1;
    

    Date parts.

    
    SELECT TO_CHAR (SYSDATE, 'Day, Month, DD, YYYY')"TO_CHAR example" 
    FROM DUAL;
    

    Date difference.

    
    SELECT TRUNC(SYSDATE) - TO_DATE('26/06/2022','DD/MM/YYYY') "Days" 
    FROM DUAL;
    

    Date year ago.

    
    SELECT ADD_MONTHS(SYSDATE, -12) FROM DUAL;
    

    Dummy Encoding.

    
    SELECT col1,
    CASE
      WHEN (animal='dog' ) THEN 1
      WHEN (animal='cat' ) THEN 2
      ELSE 0
      END animal_encoded
    from table1;  
    

    Quantitative Outliers handling.

    
    SELECT
    CASE
      WHEN (col2<0 ) THEN 0
      WHEN (col2>200 ) THEN 200
      ELSE col2
      END col2_clean
    from table1; 
    

    Missing values fill with average or 0.

    
    SELECT  col2, COALESCE(col2,AVG(col2) OVER()),
    COALESCE(col2,0) FROM table1;
    

    Incorrect values replacement.

    
    SELECT REPLACE(col2, 0,1000) from table1;
    

  5. Combining and Splitting datasets:


  6. Horizontal join with full match.

    
    SELECT t1.col1, t2.col2 FROM  t1
    INNER JOIN  t2 on t1.col3=t2.col3;
    

    Horizontal join with partial match.

    
    SELECT t1.col1, t2.col2 FROM  t1
    LEFT JOIN  t2 on t1.col3=t2.col3;
    

    Vertical union without duplicates.

    
    SELECT col1, col2 FROM  table1
    UNION
    SELECT col1, col2 FROM  table2
    

    Vertical union with duplicates.

    
    SELECT col1, col2 FROM  table1
    UNION ALL
    SELECT col1, col2 FROM  table12
    

    Simple NULL value filter.

    
    SELECT col1, col2 FROM table1
    WHERE col1 IS NOT NULL;
    

    Aggregation based filter.

    
    SELECT SUM(col1), col2 FROM table1
    GROUP BY col2
    HAVING SUM(col1)>0.3;
    

    Random sampling.

    
    SELECT col1,ROW_NUMBER() OVER
    (ORDER BY DBMS_RANDOM.RANDOM()) as rand FROM table1;
    

    Sampling by buckets.

    
    SELECT col1, NTILE(4)OVER
    (ORDER BY col2) as qualt FROM table1;
    




See also related topics: