Data Cleaning with SQL šŸ§¼

Data Cleaning with SQL šŸ§¼

Ā·

4 min read

The foundational preprocessing step explained the easy way!

Photo by Anton on Unsplash

The pandemic has taught us a lot about cleaning. Letā€™s go ahead and put those lessons to use on our data!

Curious about SQL? Check out the entire series here:

[Introduction to SQL for Data Analysis šŸ“ˆ
The what why and how of SQL when you are just getting started.medium.com](https://medium.com/@aakriti.sharma18/introduction-to-sql-for-data-analysis-1c4177b36eba "medium.com/@aakriti.sharma18/introduction-t..")

CAST

A Data Type gives the database information about how the information in a column is intended to be used and what operations can be performed on it. For instance, COUNT works with any data type, but SUM only works for numerical data.

Often while importing or creating the database, numbers and dates are treated as strings. This limits the functions that can be performed on them. To improve this, we can modify the data type as :

CAST(column_name AS data_type)

or

column_name :: data_type

TRIM

Sometimes, we are met with a situation where we need to remove certain characters from the fields. The TRIM function is used to remove characters from the beginning and end of a string. Here's an example:

SELECT column,  
       TRIM(position 'characters' FROM column)  
  FROM Tablename

The TRIM function takes 3 arguments. First, the position, whether you want to remove characters from the beginning ('leading'), the end ('trailing'), or both ('both'). Next, specify all characters to be trimmed. Any characters included in the single quotes will be removed from the beginning, end, or both sides of the string. The last argument is the column name.

LEFT / RIGHT

To retrieve a given length of characters from either the left or right side of the string and present as it a separate string.

Usage:

LEFT(column, number_of_characters)

SUBSTR

LEFT and RIGHT both create substrings of a specified length, but they only do so starting from either side of an existing string. If one wants to start in the middle of a string, SUBSTRcan be used. The syntax is -

SUBSTR(column, starting_character_position, number_of_characters)

POSITION / STRPOS

POSITION and STRPOS take a substring and return the position (counting from left) where that substring first appears in the target string.

POSITION( substring IN column )

OR

STRPOS( column , substring )

CONCAT

Strings can be combined from several columns together using CONCAT. Simply order the values you want to concatenate and separate them with commas as:

CONCAT ( col1,col2,col3ā€¦..)

Hardcoded values can be concatenated by enclosing them inside single quotes as:

CONCAT( Numerator , '/' , Denominator )

CONCATā€™s objective can also be achieved by || operator.

col1 || col2 || col3 ā€¦ā€¦ā€¦

Parsing Dates

If the date is written as a string we can surely convert it to date data type using CAST. But this only works when it is in an SQL identified format. But what if it is other formats like MM/DD/YYYY? We first need to convert it to a string in an acceptable format and then cast it. For example,

(SUBSTR(date, 7, 4) || ā€˜-ā€™ || LEFT(date, 2) || ā€˜-ā€™ || SUBSTR(date, 4, 2))::date AS cleaned_date

EXTRACT

A lot of times we need to take into account a specific part of the date, like sales in this month, admissions in this year. In such cases, we extract information from the date column as :

EXTRACT('year' FROM cleaned_date) AS year,
EXTRACT('month' FROM cleaned_date) AS month,
EXTRACT('day' FROM cleaned_date) AS day,
EXTRACT('hour' FROM cleaned_date) AS hour,
EXTRACT('minute' FROM cleaned_date) AS minute, EXTRACT('second' FROM cleaned_date) AS second, EXTRACT('decade' FROM cleaned_date) AS decade,
EXTRACT('dow' FROM cleaned_date) AS day_of_week

NOW

SQL provides a wide variety of functions to retrieve the current date, time, and timestamp. Fun fact: They can be printed without the FROM clause.

SELECT CURRENT_DATE AS date,
CURRENT_TIME AS time,
CURRENT_TIMESTAMP AS timestamp,
LOCALTIME AS localtime,
LOCALTIMESTAMP AS localtimestamp,
CURRENT_TIME AT TIME ZONE 'PST' AS time_pst,
NOW() AS now

Handling Missing Values

The most commonly found discrepancy in data is missing values. There are a lot of ways to handle this, a few listed here :

ā“ Delete the rows with missing values

ā“ Replace with mean/median/mode

ā“ Assign a unique category

ā“ Predict the missing values

Weā€™ll focus on first and third here,

NOT NULL

When we want to remove the rows with empty fields from our result we use NOT NULL that only prints the rows that have some value for the column specified.

SLELCT *
FROM Tablename
WHERE column NOT NULL

COALESCE

COALESCE is used to replace the NULL values with a specific value so they donā€™t hinder further analysis.

COALESCE(column, new value)

This was it for this iteration of the series, next part will explore Windows Functions in SQL.

Happy cleaning! āœØ

Next up in the series :

[Dealing with Databases in SQL šŸ‘½
SQL is monumental when dealing with relational databases. Letā€™s see how.medium.com](https://medium.com/@aakriti.sharma18/dealing-with-databases-in-sql-1bc455901137 "medium.com/@aakriti.sharma18/dealing-with-d..")

Interested in daily data analysis content? Follow me on Twitter and Medium.

Ā