Data journey with SQL.
After using SQL (structured query language) for data analysis let's dive deeper in data handling process and concentrate on dataset preparation task.
We will see how describe input data, work with attributes, handle missing data, find and filter outliers, encode primary data, split and combine datasets.
Python Knowledge Base: Make coding great again.
- Updated:
2024-12-01 by Andrey BRATUS, Senior Data Analyst.
Dataset description:
Working with attributes:
Combining and Splitting datasets:
Although SQL is not considered as modern tool as Python for data handling, but it gives you nice to have additional instrumets in your data exploration journey.
Volume - number of rows.
SELECT COUNT(*) FROM table1;
Rows by date.
SELECT start_date, COUNT(*) FROM table1
GROUP BY start_date
ORDER BY start_date desc;
Attributes selection.
SELECT col1, col2, col3 FROM table1;
Incomplete records.
SELECT * FROM table1
WHERE col1 IS NULL
AND col2 IS NULL
AND col3 IS NULL;
Distinct values.
SELECT DISTINCT(col1) FROM table1;
Missing values.
SELECT * FROM table1
WHERE col1 IS NULL;
Column Range.
SELECT MIN(col1), MAX(col1), AVG(col1) FROM table1;
Data type.
SELECT
column_name, data_type, data_length
FROM
all_tab_columns
WHERE
upper(table_name) = 'TABLE1'
Outliers filtering with 95% confidence.
WITH outliers as (
SELECT STDDEV(col2) as STDEV FROM table1)
SELECT col1, col2 from table1 a
CROSS JOIN outliers c
WHERE a.col2>c.STDEV*2;
Equiwidth Histogram / Distribution.
SELECT col2,
WIDTH_BUCKET(col2,100,400,10)
FROM table1;
Converting Data types.
SELECT CAST(col1 as DATE),
CAST(col2 as INT) FROM table1;
Replacing column values by pattern.
SELECT reg,CASE WHEN col1 is NULL then
REPLACE(col2,0,100) ELSE to_char(col2) END col2 FROM table1;
Data reFormatting.
SELECT UPPER(col1), REPLACE(col2,',','_') FROM table1;
Scaling.
SELECT col3, col2/(MAX(col2)OVER
(PARTITION BY col3)) from table1
WHERE col2<>0;
Buckets/Bins.
SELECT col1,
CASE WHEN col3<=100 THEN 'bin1'
WHEN col3>100 THEN 'bin12'
ELSE 'bin3' END as col3_bin
FROM table1;
Date parts.
SELECT TO_CHAR (SYSDATE, 'Day, Month, DD, YYYY')"TO_CHAR example"
FROM DUAL;
Date difference.
SELECT TRUNC(SYSDATE) - TO_DATE('26/06/2022','DD/MM/YYYY') "Days"
FROM DUAL;
Date year ago.
SELECT ADD_MONTHS(SYSDATE, -12) FROM DUAL;
Dummy Encoding.
SELECT col1,
CASE
WHEN (animal='dog' ) THEN 1
WHEN (animal='cat' ) THEN 2
ELSE 0
END animal_encoded
from table1;
Quantitative Outliers handling.
SELECT
CASE
WHEN (col2<0 ) THEN 0
WHEN (col2>200 ) THEN 200
ELSE col2
END col2_clean
from table1;
Missing values fill with average or 0.
SELECT col2, COALESCE(col2,AVG(col2) OVER()),
COALESCE(col2,0) FROM table1;
Incorrect values replacement.
SELECT REPLACE(col2, 0,1000) from table1;
Horizontal join with full match.
SELECT t1.col1, t2.col2 FROM t1
INNER JOIN t2 on t1.col3=t2.col3;
Horizontal join with partial match.
SELECT t1.col1, t2.col2 FROM t1
LEFT JOIN t2 on t1.col3=t2.col3;
Vertical union without duplicates.
SELECT col1, col2 FROM table1
UNION
SELECT col1, col2 FROM table2
Vertical union with duplicates.
SELECT col1, col2 FROM table1
UNION ALL
SELECT col1, col2 FROM table12
Simple NULL value filter.
SELECT col1, col2 FROM table1
WHERE col1 IS NOT NULL;
Aggregation based filter.
SELECT SUM(col1), col2 FROM table1
GROUP BY col2
HAVING SUM(col1)>0.3;
Random sampling.
SELECT col1,ROW_NUMBER() OVER
(ORDER BY DBMS_RANDOM.RANDOM()) as rand FROM table1;
Sampling by buckets.
SELECT col1, NTILE(4)OVER
(ORDER BY col2) as qualt FROM table1;