Data preparation with SQL essential cheat sheet.

Data journey with SQL.


After using SQL (structured query language) for data analysis let's dive deeper in data handling process and concentrate on dataset preparation task.

We will see how describe input data, work with attributes, handle missing data, find and filter outliers, encode primary data, split and combine datasets.

SQL dataset preparation.



Although SQL is not considered as modern tool as Python for data handling, but it gives you nice to have additional instrumets in your data exploration journey.


Dataset description:


Volume - number of rows.


SELECT COUNT(*) FROM table1;

Rows by date.


SELECT start_date, COUNT(*) FROM table1
GROUP BY start_date
ORDER BY start_date desc;

Attributes selection.


SELECT col1, col2, col3 FROM table1;

Incomplete records.


SELECT * FROM table1
WHERE col1 IS NULL
AND col2 IS NULL
AND col3 IS NULL;


Working with attributes:


Distinct values.


SELECT DISTINCT(col1) FROM table1;

Missing values.


SELECT * FROM table1
WHERE col1 IS NULL;

Column Range.


SELECT MIN(col1), MAX(col1), AVG(col1) FROM table1;

Data type.


SELECT
    column_name, data_type, data_length
FROM
    all_tab_columns
WHERE
    upper(table_name) = 'TABLE1'

Outliers filtering with 95% confidence.


WITH outliers as (
SELECT STDDEV(col2) as STDEV FROM table1)
SELECT col1, col2 from table1 a
CROSS JOIN outliers c
WHERE a.col2>c.STDEV*2;

Equiwidth Histogram / Distribution.


SELECT col2,
WIDTH_BUCKET(col2,100,400,10)
FROM table1;


Converting Data types.


SELECT CAST(col1 as DATE),
CAST(col2 as INT) FROM table1;

Replacing column values by pattern.


SELECT reg,CASE WHEN col1 is NULL then
REPLACE(col2,0,100) ELSE to_char(col2) END col2 FROM table1;

Data reFormatting.


SELECT UPPER(col1), REPLACE(col2,',','_') FROM table1;

Scaling.


SELECT col3, col2/(MAX(col2)OVER
(PARTITION BY col3)) from table1
WHERE col2<>0;

Buckets/Bins.


SELECT col1,
CASE WHEN col3<=100 THEN 'bin1'
 WHEN col3>100 THEN 'bin12'
   ELSE 'bin3' END as col3_bin
     FROM table1;

Date parts.


SELECT TO_CHAR (SYSDATE, 'Day, Month, DD, YYYY')"TO_CHAR example" 
FROM DUAL;

Date difference.


SELECT TRUNC(SYSDATE) - TO_DATE('26/06/2022','DD/MM/YYYY') "Days" 
FROM DUAL;

Date year ago.


SELECT ADD_MONTHS(SYSDATE, -12) FROM DUAL;

Dummy Encoding.


SELECT col1,
CASE
  WHEN (animal='dog' ) THEN 1
  WHEN (animal='cat' ) THEN 2
  ELSE 0
  END animal_encoded
from table1;  

Quantitative Outliers handling.


SELECT
CASE
  WHEN (col2<0 ) THEN 0
  WHEN (col2>200 ) THEN 200
  ELSE col2
  END col2_clean
from table1; 

Missing values fill with average or 0.


SELECT  col2, COALESCE(col2,AVG(col2) OVER()),
COALESCE(col2,0) FROM table1;

Incorrect values replacement.


SELECT REPLACE(col2, 0,1000) from table1;

Combining and Splitting datasets:


Horizontal join with full match.


SELECT t1.col1, t2.col2 FROM  t1
INNER JOIN  t2 on t1.col3=t2.col3;

Horizontal join with partial match.


SELECT t1.col1, t2.col2 FROM  t1
LEFT JOIN  t2 on t1.col3=t2.col3;

Vertical union without duplicates.


SELECT col1, col2 FROM  table1
UNION
SELECT col1, col2 FROM  table2

Vertical union with duplicates.


SELECT col1, col2 FROM  table1
UNION ALL
SELECT col1, col2 FROM  table12

Simple NULL value filter.


SELECT col1, col2 FROM table1
WHERE col1 IS NOT NULL;

Aggregation based filter.


SELECT SUM(col1), col2 FROM table1
GROUP BY col2
HAVING SUM(col1)>0.3;

Random sampling.


SELECT col1,ROW_NUMBER() OVER
(ORDER BY DBMS_RANDOM.RANDOM()) as rand FROM table1;

Sampling by buckets.


SELECT col1, NTILE(4)OVER
(ORDER BY col2) as qualt FROM table1;




See also related topics: