[Starbucks Twitter Sentiment Analysis] Instructions and Spark NLP
Setup with Confluent Kafka, Spark, Delta Lake with Databricks and AWS
Instruction
In this post, we will set up environment to perform Starbucks Twitter Sentiment Analysis with Confluent Kafka, Spark, Delta Lake with Databricks and AWS.
Step 1. Twitter API Credentials
As we performed in the previous post, we need to get Twitter API Credentials. After getting it, we save these credential information in .env
file.
Make sure to include .env
file in .gitignore
to be ignored in the future.
|
|
Step 2. Confluent Cloud
Confluent Cloud is a resilient, scalable streaming data service based on Apache Kafka®, delivered as a fully managed service - Confluent Cloud. It offers users to manage cluster resources easily.
2-1. Create a Confluent Cloud account and Kafka cluster
First, create a free Confluent Cloud account and create a kafka cluster in Confluent Cloud. I created a basic cluster which supports single zone availability with aws
cloud provider.
2-2. Create a Kafka Topic named tweet_data
with 2 partitions.
From the navigation menu, click Topics
, and in the Topics page, click Create topic
. I set topic name as tweet_data
with 2 partitions, the topic created on the Kafka cluster will be available for use by producers and consumers.
Step 3. Confluent Cloud API credentials.
API keys
From the navigation menu, click API keys
under Data Integration
. If there is no available API Keys
, click add key
to get a new API keys (API_KEY, API_SECRET) and make sure to save it somewhere safe.
HOST: Bootstrap server
From the navigation menu, click Cluster settings
under Cluster Overview
. You can find Identification
block which contains the information of Bootstrap server
. Make sure to save it somewhere safe. It should be similar to pkc-w12qj.ap-southeast-1.aws.confluent.cloud:9092
HOST = pkc-w12qj.ap-southeast-1.aws.confluent.cloud
Save those at $HOME/.confluent/python.config
|
|
Press i
and copy&paste the file below !
|
|
Then, replace HOST, API_KEY, API_SECRET with the values from Step 3
. Press :wq
to save the file.
Step 4. Create a Databricks Cluster
In this post, we are going to deploy Databricks on the AWS. Instruction to create a Databricks Cluster on AWS is well explained in HERE.
Click the compute
under navigator bar, create a Create Cluster
, and add some configuration like below in the picture.
After creating a Databricks Cluster, it’s time to explore the Databricks Workspace. Click the Workspace
under navigator bar. Click the users
, <user-account>, then create a
Notebook`.
Once you are done with creating the Databricks Notebook, please check the my github page for the source code of twitter data ingestion.
Step 4-1. Install Dependencies
When you are creating a Cluster, you can find the libraries
tab next next to Configuration
tab.
If you need any dependencies needed in the future, you can use this to install. Or you can install dependencies like this, %pip install delta-spark spark-nlp==3.3.3 wordcloud contractions gensim pyldavis==3.2.0
too.
Step 5. Source Code for twitter data ingestion
Check the source codes in my github page
- producer/producer.py
- producer/ccloud_lib.py
- run.sh
- Dockerfile
- .env
- requirements.txt
Still some modifications are needed
|
|
Procedure to run the kafka twitter data ingestion
|
|
Step 6. Spark Streaming in Databricks - Streaming Data Ingestion
Add Confluent API Credentials as we used before in Step 3
and copy and paste the code below for readStreaming Kafka data in the Workspace we created.
|
|
|
|
In order to run the Workspace, we need to attach
the cluster
we created before.
Step 7. Spark Streaming in Databricks - Streaming Data Transformation
Please check the whole procedure of streaming data tranformation in notebooks/twitter_ingestion_transformation.ipynb
in my github-pages!
Step 8. Connect DataBricks and Delta Lake
We are able to build a complete streaming data pipeline to consolidate the data by using Confluent Kafka as an input/source system for Spark Structured Streaming and Delta Lake as a storage layer.
Delta Lake
is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It helps unify streaming and batch data processing. ADelta Lake
table is both a batch table as well as a streaming source and sink. As data are stored in Parquet files, delta lake is storage agnostic. It could be an Amazon S3 bucket or an Azure Data Lake Storage container - Michelen Blog.
|
|
Step 9. Spark NLP
|
|
Installing Spark NLP from PyPI is not enough to run Spark NLP in Databricks. Therefore, we still need to install a dependency - spark-nlp_2.12:3.4.4
(something similar to this one) under the libraries
tab in the Cluster
.
Or attch spark-nlp-1.3.0.jar to the cluster. This library can be downloaded from the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp.
|
|
Create a Preprocessing Stages Pipeline
|
|
|
|
The pipeline is followed by the procedure as below.
[Option] integrate databricks notebook with Github
You can connect databricks notebook with Github for the revision history. The procedure is described in here
Jupyter Notebook
Reference
- https://github.com/scoyne2/kafka_spark_streams
- https://blogit.michelin.io/kafka-to-delta-lake-using-apache-spark-streaming-avro/
- https://medium.com/@lorenagongang/sentiment-analysis-on-streaming-twitter-data-using-kafka-spark-structured-streaming-python-part-b27aecca697a
- https://winf-hsos.github.io/databricks-notebooks/big-data-analytics/ss-2020/Word%20Clouds%20with%20Python.html