PYTHON DATA SCIENCE

Installing Apache Spark on Windows 10

Quick Dirty Self Note on Installing Spark

Frank Ceballos

Published in

Frank Ceballos

5 min readFeb 11, 2020

So I just got hold of some election data and when I tried crunching some numbers, well my computer wasn’t too happy. So finally I decided that I needed to learn Spark because someone needs to look into this election data and make cool maps — obviously me.

I’ve been browsing the web trying to find the easiest way to install Spark on my Windows machine. It looks like most guides require tons of steps and I’m not about to invest a significant amount of time trying to follow them to then fail. Here is the simplest way to do it, assuming you have Anaconda already installed.

Note: In the case you’re starting from scratch, I will advise you to follow this article and install a machine learning environment with Anaconda.

Installing the Java Development Kit

After you already have installed Anaconda, we will proceed on installing the Java Development Kit (JDK). This is necessary step because Spark runs on top of the Scala programming language and Scala runs on top of the JDK. So head over to Google and search for jdk and click on the first result.

This will take you to Java downloads. Scroll down until you see the section below and click on the Download button.

This will take you to the download page. Scroll down to the section shown below and accept the License Agreement and select the download option for your operating system.

Once you select the JDK for you operating system, you will need to sign in or create an account in order to download the file. I thought this was weird but whatever it takes like 30 seconds to make an account.

Launch the exe file you downloaded. In my case the file name is:

jdk-8u241-windows-x64.exe

This window will pop open. Just click Next.

Next this Window will show up:

Click on Change and change the path to:

C:\jdk

The reason we are changing where we are installing JDK is because some versions of Spark won’t work if the path has space in it. Notice how the default path haves “Program Files” which contains a space in between those two words. Once you make the change your window should look like this:

Click next and proceed with the installation. After a minute or two you will need to install the Java Runtime environment. The following window will show up.

Click on Change. You will then need to browse for a folder to install the Java Runtime environment. Go to the C drive and make a new folder with the name ‘jre’. See the figure below for an example.

Click OK. The following changes will be implemented.

Notice the path name is now C:\jre . That’s what we want. Click Next and finish the installation.

Finally, to make Spark work you will need to download winutils.exe from here. Now, go to your C drive and create a folder named winutils . Inside the folder winutils , create a subfolder named bin. Inside bin paste the executable file winutils.exe. If you did everything correctly, you should have winutils.exe located in C:\wintutils\bin , see the Figure below.

In this last step, we will tell Hadoop where to find winutils.exe by creating an environmental variable. In Windows 10, go to the search bar and type advanced system settings and click on the first result.

The following window titled System Properties will pop up.

Click on the Advanced tab and then click on Environmental Variables. The following will should show up.

A New User Variable window will pop up. Now create the HADOOP_HOME variable, see the image below.

Click OK. We are now set with this part of the installation.

Installing Apache Spark

We are now ready to install Apache Spark. This will take about 2 minutes so bare with me.

Open Anaconda Prompt and activate the environment where you want to install PySpark. I will be installing PySpark in an environment named PythonFinance.