Google Colab
In this post, a basic dataset analysis will be implemented in a Google Colab notebook using Python and Pyspark code. Google Colab is invaluable for data scientists when it comes to managing massive datasets and running complex models. However, PySpark/Python is just a barrier to learning and practice for data engineers. So, what happens if we put these two together? Who are the best players in their respective categories?
Implement Python and Pyspark code in Google Colab Notebook
We almost always have the perfect solution for all of your machine learning and data science problems!
We’ll examine how to run Python and Pyspark code on a Google Colaboratory notebook in this post. Additionally, we will conduct some basic data exploration exercises that are common to the majority of data science problems. Let’s get to work, then!
The research that would become Apache Spark, a powerful tool for real-time analysis and building machine learning models, was created in 2010 as part of a class project at UC Berkeley [1]. Spark is a platform for distributed data processing that is useful for processing large amounts of data due to its computing capacity and scalability. Unlike Python or Java, it is not a programming language.
PySpark is an Apache Spark interface that lets users build Spark applications with Python APIs and interactively analyze massive volumes of data in distributed systems. PySpark provides access to the majority of Spark features, such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning Library), and Spark Core.
Installing Python and PySpark Libraries in Google Colab Notebook
#necessory libraries import pandas as pd from pyspark.sql.functions import row_number,lit ,desc, monotonically_increasing_id from pyspark.sql.functions import desc, row_number, monotonically_increasing_id from pyspark.sql.window import Window from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField,IntegerType, StringType , DateType,FloatType
Connecting to Google Drive
Mounting your Google Drive should be your top priority when using Colab. You can access any directory on your drive by doing this within the Colab notebook.
Google Drive mounting is important in Google Colab because it allows you to access files and data stored on your Google Drive from your Colab notebooks instantly. Google Colab, an open-source computing environment for Jupyter notebooks, provides a free platform for machine learning and data analysis.
You can use Python and PySpark code to access the virtual filesystem that is created when you mount your Google Drive in Colab. Because of this, you can access files, read and write data, and perform other file-related tasks on your Google Colab notebook just like you would on a local computer.
Another benefit of mounting your Google Drive in Colab is that it let you store and share big datasets with your team without worrying about local storage constraints. Working on projects across computers is also made simpler because you can access your data from any location with an internet connection.
To work efficiently with data in Google Colab, linking your Google Drive is a crucial first step. This integration enables seamless data management and sharing , significantly boosting productivity for data analysis and machine learning projects.
From the Google.colab import drive,
#mount google drive first from google.colab import drive drive.mount('/content/drive')
I’m assuming that you have already uploaded your dataset Excel file using the upload button located in the upper left corner of your Google Colab notebook.
Creating Spark Session and Perform Analysis
Since we’re assuming you already understand the fundamentals of Spark and datasets, I’ll be sharing code here for a more thorough understanding.
spark = SparkSession.builder.appName("Basics").getOrCreate() #upload file here first df = pd.read_excel('/content/IceCream.xlsx') #you can direct copy the path of uploaded by clicking on right mouse button and copy path and paste here. df.head() df.describe() df.info() df.columns #create schema for your dataframe schema = StructType( [StructField("SalesDate", DateType(), True)\ ,StructField("SalesQty",IntegerType(), True)\ ,StructField("SalesAmount", FloatType(), True)\ ,StructField("ProductCategory", StringType(), True)\ ,StructField("ProductSubCategory", StringType(), True)\ ,StructField("ProductName", StringType(), True)\ ,StructField("StoreName", StringType(), True)\ ,StructField("StoreRegion", StringType(), True)\ ,StructField("StoreProvince", StringType(), True)\ ,StructField("StoreZone", StringType(), True)\ ,StructField("StoreArea", StringType(), True)\ ,StructField("PaymentTerms", StringType(), True)\ ,StructField("SalesMan", StringType(), True)\ ,StructField("Route", StringType(), True)\ ,StructField("Category", StringType(), True) ] ) df2 = spark.createDataFrame(df,schema=schema) df2.show()
You may read our other blogs: Social Media and it’s impact