Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. Most of the beginner feels difficulty to install Hadoop in their Ubuntu system.
Software Required:
- Java Virtual Machine(JDK-1.8) Step to install JDk In Linux
- ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. For installing ssh in Ubuntu Linux: $ sudo apt-get install ssh
- Hadoop distribution file, download a recent stable release from one of the Apache Download Mirrors.
After downloading hadoop file unpack the file and make a change in etc/hadoop/hadoop-env.sh which includes defining the root of your Java installation as export JAVA_HOME=/home/java_folder_name . Now You can set Hadoop environment variables by appending the following line in ~/.bashrc file as:
After editing the file, you need to save the .bashrc file and enter the command source ~/.bashrc in home directory .
Configuration File Set Up:
We need Pseudo-Distributed Mode for running the operations. We need to edit configuration files as below:
etc/hadoop/core-site.xml:
etc/hadoop/hdfs-site.xml:
etc/hadoop/mapred-site.xml:
etc/hadoop/yarn-site.xml:
Setup passphrase less ssh :
- ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
- cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- chmod 0600 ~/.ssh/authorized_keys
Execution Step:
The following instructions are to run a MapReduce job locally.
Format the filesystem:
$ bin/hadoop namenode -format
Start NameNode daemon and DataNode daemon:( inside hadoop/sbin directory)
$ start-dfs.sh
To start all the configuration files use:
$ start-all.sh ( inside hadoop/sbin directory)
The Hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
Hadoop Basic Command:
Make the HDFS directories required to execute MapReduce jobs:
hadoop fs -mkdir /user
hadoop fs -mkdir -p /user/<username>/<project_name>
Copy the input files into the distributed file system:
hadoop fs -put /user/<username>/<project_name>/ <input file>
You can check which demons are running in your machine using the command as:
$ jps
When you’re done, stop the daemons with:
$ stop-dfs.sh ( inside hadoop/sbin directory)
Or to stop all the configurations use:
$ stop-all.sh ( inside hadoop/sbin directory)
Web interface:
Hadoop2 HDFS : http://localhost:50070
Hadoop3 HDFS: http://localhost:9870
Hadoop Application: http://localhost:8088
Job History : http://localhost:19888
If you stuck or get an error you can leave a comment.