Tanveer Khan Data Scientist @ NextGen Invent | Research Scholar @ Jamia Millia Islamia

Superstore sales data analysis using Big Data Hadoop framework


Big Data Analytics refers to the strategy of analyzing large volumes of data, or big data. This big data is gathered from a wide variety of sources, including social networks, videos, digital images, log files, sensor data, and sales transaction records, etc. The aim is to analyze all this data to discover patterns, findings, and trends that will help the concerned stakeholders to take informed decisions.


The general and specific objectives are being discussed in this section.

General objective:

The main objective here is to use big data hadoop framework for analyzing superstore sales data and to design a Graphical user interface (GUI) for the same.

Specific objectives:


Superstore Sales - This dataset has transactions records of US customers from years 2014-2018 for an E-commerce platform that allows people to buy products from books, toys, clothes, and shoes to food, furniture, and other household items.


Our motivation lies in exploring superstore dataset for finding answers to these questions.


Hardware requirements:

Software requirements:


The proposed methodology for deisgn and development of Graphical user interface (GUI) and writing MapReduce jobs and HIVE queries for analysis pupose have been discussed here.

Building a graphical user interface (GUI)

For this purpose Netbeans software was used for designing a GUI. The figures below shows the flow chart and data flow diagram of the build system.

Flow chart of the proposed system. flowchart

Data flow Diagram of the proposed system. DFD

Figures below shows the developed GUI for the proposed system. For the purpose of analyzing superstore dataset the developed GUI has screens for authorization purpose, Execution of MapReduce jobs, generating reports, and saving the reports to the local system.

Login screen design for authorization purpose.


Screen design for generating MapReduce jobs.

Screen design for for generating reports.


The soucre code for the design and development of the proposed sytem system can be found here.

Single Node Hadoop Installation on Linux v16.04:

Installing Java

:~$ cd ~ # Update the source list ~$ sudo apt-get update # The OpenJDK project is the default version of Java # that is provided from a supported Ubuntu repository. :~$ sudo apt-get install default-jdk :~$ java -version java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Adding a dedicated Hadoop user

:~$ sudo addgroup hadoop Adding group `hadoop' (GID 1002) ... Done. :~$ sudo adduser --ingroup hadoop hduser Adding user `hduser' ... Adding new user `hduser' (1001) with group `hadoop' ... Creating home directory `/home/hduser' ... Copying files from `/etc/skel' ... Enter new UNIX password Retype new UNIX password: passwd: password updated successfully Changing the user information for hduser Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] Y

Install Hadoop

:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz :~$ tar xvzf hadoop-2.6.0.tar.gz :/home/hduser$ sudo su hduser :~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop :~/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop

The following files will have to be modified to complete the Hadoop setup:

~/.bashrc /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml

1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has been installed to set the JAVA_HOME environment variable using the following command.

hduser@laptop update-alternatives --config java There is only one alternative in link group java (providing /usr/bin/java):/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java Nothing to configure. # Now we can append the following to the end of ~/.bashrc: hduser@laptop:~$ vi ~/.bashrc #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END hduser@laptop:~$ source ~/.bashrc

2. Editing the /usr/local/hadoop/etc/hadoop/hadoop-env.sh.
We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

3. Editing the/usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that Hadoop uses when starting up. This file can be used to override the default settings that Hadoop starts with. Open the file and enter the following in between the tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories.

4. Editing the mapred-site.xml.template file.
By default the /usr/local/hadoop/etc/hadoop/ folder contains /usr/local/hadoop/etc/hadoop/mapred-site.xml.template. The mapred-site.xml file is used to specify which framework is being used for MapReduce. We need to enter the following content in between the tag:

mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

5. Editing the /usr/local/hadoop/etc/hadoop/hdfs-site.xml.
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host. Before editing this file, we need to create two directories which will contain the namenode and the datanode for this Hadoop installation. Open the file and enter the following content in between the tag:

hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml dfs.replication 1 dfs.namenode.name.dir file:/usr/local/hadoop_store/hdfs/namenode dfs.datanode.data.dir file:/usr/local/hadoop_store/hdfs/datanode

Format the New Hadoop Filesystem. Now, the Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it creates current directory under /usr/local/hadoop_store/hdfs/namenode folder. After this we are good to go..!!!


The Proposed system will provide us with the Graphical User interface (GUI) for performing the execution of MapReduce jobs efficiently and in an easy way by analyzing large dataset on Big Data Hadoop single node cluster. Additionally, proposed system will act as baseline for other datasets as we can modify it with the reports we want to execute by producing MapReduce Jobs for the same.