Analyze Twitter’s Reaction to Taylor Swift with HarperDB

An end-to-end Data Analysis Portfolio Project.

2500000000000000000

If you had trouble reading this, let’s simplify it for you. This is 2.5 quintillion bytes, or the amount of data estimated to be generated per day by humans. Knowingly or unknowingly all of us contribute to this in one way or the other, making data one of the most precious commodities on the planet right now.

What’s even more valuable than the data itself is the ability to make something out of it. Businesses aren’t interested in filling up their storages with millions of datapoints, they are interested in the insights and takeaways that come out of it. This is what makes resources that perform this highly valued across the globe. Creating systems that extract, analyze, predict, and forecast data seamlessly are the need of the hour.

To do so, we require high performance database management systems that help with scaling and optimization of the data transaction process. Services like HarperDB Cloud help with managing huge databases on the cloud, making the process more scalable, accessible, and rapid, with a built-in HTTP API endpoint accessible in all major languages.

Problem Statement

It is a said truth that it’s impossible for one to be devoid of Taylor Swift’s music. The American singer and songwriter has many award winning records and is known for her autobiographical songs. Her re-recorded songs dropped this month and fans have been flooding timelines with their reaction, comments, thoughts, and of course conspiracy theories.

This project aims at extracting tweets that include the topic ‘Taylor Swift’ and performing a detailed analysis by exploiting natural language processing.

Photo by Raphael Lovaski on Unsplash

Data Collection

For collecting the data we would be scraping tweets with the keyword ‘Taylor Swift’ in them using the Twint library in python.

Installation

git clone — depth=1 https://github.com/twintproject/twint.git
cd twint
pip3 install . -r requirements.txt

Tip : Change aiohttp in the requirements file to aiohttp==3.7.0 to dodge an error ahead

Scraping

import twint

c = twint.Config()

c.Search = [‘Taylor Swift’] # topic
c.Limit = 2000000 # number of Tweets to scrape
c.Store_csv = True # store tweets in a csv file
c.Output = “tweets.csv” # path to csv file

twint.run.Search(c)

I prefer using Twint over Tweepy because of its syntax simplicity, no authentication and limitless features. The above code block creates a file with 100k+ tweets along with their properties. It takes a while to process and then all the tweets are saved in the tweets.csv file in the following format:

The second component of data is sourced from Kaggle. You can download it here , unzip and add it to your project folder.

Data Preprocessing

Since the data scraping process was automated, it is possible that errors and inconsistency were introduced in the data. Data exploration is necessary to identify and remove any such faults.

Null Values

The count of cells having null values in each column is printed as -

It can be seen that 10+ of the columns have no or very few values in them. They can be removed using the following command :

df=df.drop([‘near’, ‘geo’, ‘source’, ‘user_rt_id’, ‘user_rt’,
‘retweet_id’, ‘retweet_date’, ‘translate’, ‘trans_src’,
‘trans_dest’],axis=1)

After columns let’s remove rows with empty values as -

df=df.dropna()

Here we are dropping the rows with null values as they comprise a very small amount of data ( < 2%) , if only this was a large amount we would have adopted different methods like replacing with median/average or predicting missing values.

Let’s check for null values now:

Looks perfect!

The updated dimensions of the dataframe are -

Now that we are free of any missing cells, we would look at the values of individual columns and make sure no unnecessary elements have been captured.

Investigating Columns

Starting with the column timezone, we can see that only one value is present for the entire dataset which is the value of my country where the data was captured. This is inaccurate as the tweets definitely would have been posted from all around the world making this column irrelevant.

Any dataset needs a primary key to uniquely identify rows and faster querying, upon exploration it can be seen that there are two keys that can serve as a primary key i.e id and conversation_id.

In order to serve as a primary key it is necessary that the column has a unique value for each row. To check if these column satisfy these conditions run the following -

We can see that the column id has unique values for each row i.e the number of unique values occurring in the column is equal to the number of rows hence it is used as the primary key. On the other hand, conversation_id is not an uniquely identifiable attribute and can hence be dropped.

Next, after noticing the values in a few rows we can see that created_at is just a concatenation of the fields date and time, therefore it is redundant and safe to drop.

Text preprocessing

Now that we only have the necessary columns that actually add value to the data, we can begin with text preprocessing. Even the text in each tweet might contain redundant data which won’t contribute much to the system’s performance and hence is better to be removed out initially. Text preprocessing involves passing textual data through a pipeline that cleans and prepares data for a specific context, which here is Noise Removal.

The noise here includes:

URLs
Mentions
Emojis
Reserved words
Hashtags

These might prove useful in a different context but here since we have a dedicated column for these, we need not bulk up the data with them as well.

Removing hashtags

Since we already have the list of hashtags, we would simply iterate over the list and replace them in the original tweet with an empty character. It requires a little bit of string manipulation considering how the data is stored but with this function you are good to go!

def hashtag_removal(r):
twt=r[‘tweet’].lower()
tags=list(r[‘hashtags’][1:-1].split(‘,’))
for i in tags:
i=i.strip()
i=i[1:-1]
twt=twt.replace(‘#’+i,’’)
return twt
df[‘clean_tweet’]=df.apply(lambda x:hashtag_removal(x),axis=1)

Removing URLs , Mentions and Emojis

Retaining just the tweet matter and removing all the mentions ( usernames mentioned in the tweets ), URLs and emojis further preps the tweet for better processing. It is done using the tweet-processor module.

!pip install tweet-processor
import preprocessor as p
df[‘clean_tweet’]=df[‘clean_tweet’].apply(lambda x:p.clean(x))

This leaves us with a cleaner, more direct version of the tweet with lesser noise.

Removing punctuation

If you look closely at the tweet text, you’ll notice there are some bizarre characters that make no sense in a few tweets, like the one highlighted here :

These might have gotten into the data while scraping punctuation as their equivalent codes. Let’s remove them using Python’s regex module where we match the pattern against the one here which is starting with a & and ending with ;

import regex as re
def punctuation_removal(r):
ls=re.findall(r’&(\w+);’, r)
for i in ls:
r=r.replace(f’&{i};’,’’)
punc = ‘’’!()-[]{};:’”\,<>./?@#$%^&*_~’’’
for ele in r:
if ele in punc:
r = r.replace(ele, “”)
return r
df[‘clean_tweet’]=df[‘clean_tweet’].apply(lambda x:punctuation_removal(x))

Remove digits

As mentioned before, preprocessing steps depend largely on the context they are being used in. If for instance, we were dealing with sales data and analyzing a corpus, it would be necessary to extract all the numerical values. However, since the tweets are related more towards the sentiment side and do not need to be quantified, it’s safe to remove them. We can again turn to regex for the same.

df[‘clean_tweet’] = df[‘clean_tweet’].astype(str).str.replace(‘\d+’, ‘’)

Tokenization

Tokenization is the process of segregating raw text into smaller units. Each of these units is called a token and can be a word, character or sentence depending on how and where the tokenization process is applied. Tokenization helps understand the context better and create a better language model. The Natural Language Toolkit library in Python is used to carry this out.

import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download(‘wordnet’)
nltk.download(‘stopwords’)
nltk.download(‘omw-1.4’)
from nltk.tokenize import TweetTokenizer

lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()
def lemmatize_text(text):
return [(lemmatizer.lemmatize(w)) for w in w_tokenizer.tokenize((text))]

df[‘tokenized_tweet’] = df[‘clean_tweet’].apply(lambda x:lemmatize_text(x))

This creates a new column named tokenized_tweet which stores a list of tokens from the cleaned tweet.

Stop Word Removal

As the final step of preprocessing we remove stop words from our tokenized text. Stop words are commonly used words in a language, for english ‘a’,’is’,’are’ are few of the stop words. As they don’t add much value to the meaning of a text and are added merely for the structure, we remove them from the tokens.

stop_words = set(stopwords.words(‘english’))
df[‘tokenized_tweet’] = df[‘tokenized_tweet’].apply(lambda x: [item for item in x if item not in stop_words])

The only downside to this process is that the data collected was in a wide variety of languages, 44 in total. However the corpora for all those languages are not available yet. Since most of the data was in english , english stop words were used to carry this out.

And we are done!

Save the data in a csv as -

df.to_csv(‘clean_tweets.csv’, index=False)

Setting up HarperDB

HarperDB’s data and application platform delivers globally-distributed performance and simplicity in one package. It’s the first distributed database with hybrid SQL and NoSQL functionality in one, and the platform is enabling solutions like edge computing, real-time analytics, and rapid app development.

The most notable features of HarperDB are:

Single endpoint CRUD operations
Built-In HTTP API
Custom Functions
Simple UI
Free forever tier, with Enterprise, On-Prem, & SaaS Licensing
Multiple plugin support across major languages and frameworks
Both SQL and NoSQL supported
Returns arrays of JSON removing the need of ORM

Get started with setting up your database through the following steps -

Create your HarperDB account

Go to HarperDB’s signup page, and add all your details.

Create an instance

After the signup, you’d be presented with the following screen.

Click on Create New HarperDB Cloud Instance. You’ll be presented with two options :

Create a HarperDB cloud instance that is created, managed, and scaled by HarperDB itself.
Install HarperDB in your own server and add it to the UI.

Launch the Instance

Click on Create AWS or Verizon Wavelength Instance for a cloud instance. Select AWS as cloud provider, fill all the credentials and leave the default setting and Confirm. It takes a few minutes to spin up the instance.

Create the Schema

Next, click on the instance and name your schema (structure of the database) and add the name of your tables. I have used the names tweets and songs with hash attribute id for both.

Import Data

Click the data icon :

at the top right of the table browser. You will be directed to the CSV upload page where you can choose to import a CSV by URL or upload a CSV file.

Drag your csv files onto their respective tables and click on Insert.

If it prompts that the clean_tweets.csv file is too large for upload, try splitting the dataframe and uploading it.

Congratulations! You have reached the end of the tutorial. This was part 1 of the two part end-to-end Data Analysis Project Tutorial. This one talked about -

Scraping data
Setting up a CSV for processing
Utilizing python for processing text data
Data preprocessing steps like stop word removal, lemmatization and noise reduction
Setting up a cloud database
Importing Data to the cloud instance

The next iteration would revolve around the following -

Setting up a ODBC driver in Tableau
Connecting it to the HarperDB cloud database
Creating meaningful visualization from the queried data
Hosting the interactive dashboards

Stay tuned!

Photo by Huma Kabakci on Unsplash