How can we implement a Big Data technologies for data visualization? | Spark engine, Elasticsearch and Kibana.

I have taken course CSCi 765 Introduction To Database Systems  offered by Computer Science Department. During this course I had to do a small project under 30hrs. Being a precision agriculture Researcher I had decided to do a project related to precision agriculture. So I used some sensor data to visualize data. For my project I tried to use open source software for data processing storage and visualization.

Abstract

Climate change is the greatest challenge in agriculture. Data can play vital role to lower the impact of climate change in this field. Data from different sensor deployed in farm can be used to visualize weather pattern, rain pattern, and soil temperature pattern and so on over the certain date range. As time passes these data grows on rapidly and become a challenging to visualize data and get some meaningful information to solve the problems that exists in agriculture. Big data can play key role to handle process and visualize these exponentially growing data. This project is an attempt to implement big data technologies in agricultural data visualization on low resources PC which later can be implemented on cloud as data volume increases. In this project we are using Spark, Elasticsearch and Kibana in low resources system to visualize a data.

1 Introduction

Agriculture is the one of the important aspect of society since from the beginning of human civilization. Agriculture needs to feed the world population. But climate change has become a great threat to this field. To lower the impact of the climate change and to increase crops yield, big data [1] can play an important role. Spark[2], Elasticsearch and Kibana [3] are the technologies which can help to deal with exponentially growing big volume of data in agriculture. In our project we used Spark engine as data processing engine and Sbt [4] tools to compile, build and package a project.

Big Data in Agriculture Precision agriculture is an emerging field which deals to solve a problem and to automate problem in agriculture using technology. Big Data is one of the hot prospect to deal with the growing volume of data and even this term comes together with machine learning and deep learning.

Apache Spark Apache Spark [2] is a unified analytics engine for large-scale data processing. This is open source project for big data and machine learning. It is based on Hadoop Map Reduce and it extends the Map Reduce model to efficiently use for different types of queries, and stream processing. Main feature of Spark is memory cluster computing.

Figure 1. Apache Spark Engine and its use by different databases

HDFS Hadoop Distributed File System is a distributed file system that can conveniently run on commodity hardware for processing unstructured data. It is capable of fault-tolerant. Data replicate over several machines. Spark can be built on the top of HDFS as shown in figure 2. So, in our project we read an input data for the spark from the HDFS where data is stored in comma delimited CSV format.

Figure 2. Spark on the top of Hadoop

Sbt tool sbt is an open-source build tool for Scala and java projects, similar to java’s Maven. It helps to integrate spark and elasticsearch API libraries.

ElasticSearch ElasticSearch [3] is a real-time distributed and open source full-text search and analytic engine. ElasticSearch is scalable up to petabytes of structured and unstructured data. It is categorized as NoSQl database. It is distributed, which make it easy to scale and integrate in any big organization.

Kibana Kibana is an open source data visualization dashboard for ElasticSearch. It provides visualization capabilities on top of the content indexed on an ElasticSearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data. The visualization makes it easy to predict or to see the changes in trends of even in input data source.

Outline The remainder of this article is organized as follows. Section 2 is about literature review which gives an account of work done previously. Implementation of project is described in Section 3. Section 4 and section 5 discusses about the output result, analysis and conclusion respectively.

2 Literature Review

Till date, work and research on agriculture using big data technology is limited. So Precision agriculture is in evolving state which is completely focused on agriculture to solve a problems in this area using a technology. The Research in this area opens the door for experiments [1]. There is an article about using Elasticsearch [3] for data analysis as in social media platform [5] and proof of concepts in research [6]. Similar type of data processing task has been done in the field of health care [7] data as well. So big data processing and analytics using big data platform [7] [8] [9] has been performed in wide variety of fields like health, climate and so on. Use of Spark [2] can help to deal with the unprecedented growth of agriculture data similar to that of climate data[10][8].Data visualization on GitHub Parameters [11] has been performed previously. Similarly, we are performing data visualization in agriculture data.

3 Methodology

This section described overall design, architecture and operation of the project along with the tools and technique used. Since this project is also the demonstration of development of big data project in low resources, we are running this project, in low resources PC. Flowchart of project is in fig 3.

Figure 3. Flowchart showing overall project flow.

In this project Spark engine is used for data processing which reads data from the Hadoop distributed file system. Hdfs contain data from different sensor that is deployed in agricultural farm. In this project we use data from wind speed sensor, wind direction sensor, Co2 sensor, Soil Moisture and Temperature sensor, Atmospheric Pressure Sensor, Light Intensity sensor, Rain gauze and atmospheric temperature and humidity sensor. Input file is comma delimited csv file which is shown in figure 4.

Figure 4. Input file in hdfs

And then we build a project using Sbt tool. We are using spark-core, spark-sql and elasticsearch-spark API which is define in sbt.build file as shown in Figure 5.

Figure 5. Spark and elasticsearch library to build a sbt project

To submit a job to spark engine we run a project in spark-shell using spark-submit with some parameters. On the other hand we need elasticsearch data base up and running with cluster health green. We are using elasticsearch 7.4 version for this project. Before submitting a job we need to create elasticsearch index and run mapping (Figure 6) on it which define the field data type. Though elasticsearch is NoSQL we are defining a field and their data type in index before writing a data. Finally data is written to elasticsearch. Spark and Elasticsearch can be deployed in distributed environment and run in multiple node cluster.

Figure 6: Elasticsearch Mapping Bash Script

Then we run kibana along with elasticsearch. Kibana can have access to the data in the elasticsearch. Finally Kibana Dashboard is used in data visualization. We will discuss result from data visualization in output and result analysis section.

4 Output and Result Analysis

In this section we discussed result from the Kibana Dashboard data visualization. Kibana has filters and aggregation feature which is used to create different graph and trend using a data in elasticsearch index. Figure 7 shows the soil median graph obtained from Kibana Dashboard for the date interval Oct 10, 2019 to Nov 26, 2019.

Figure 7. Soil moisture median graph for certain date interval

Similarly we can obtained different graph and visualization we need on require field. We can feed data into Machine learning feature tab in Kibana and get result as shown in Figure 8. Figure 8 shows the Humidity distribution values and Light Intensity top Values. We have all together 12932 number of documents in elasticsearch. Each document is unique and have 9699 distinct values for humidity and 388 distinct values for light intensity.

Figure 8 Humidity distribution of values and light intensity top values

In Kibana we can shape our data using a variety of charts, tables. We are discussing only some of the visualization example in our data. Figure 9 shows the Percentile Rank aggregation of data over certain date interval. And Figure 10 is about the number of document graph over certain date interval and ambient humidity top values.

Figure 9 Percentile Rank Graph

Figure 10 Document Count Graph and ambient Humidity value

5 Conclusions

This experiment made clear on couple of things. First, we can develop big data application on local low resources platform so that we scale it on distributed environment when data volumes grows. This project can be useful for data visualization for the data over the decades from multiple agricultural farms. Secondly, we can use these technologies to deal with the climate change problem in agriculture. We can see the data trend over the long period of time and use in future prediction. When term big data comes people fear that we need to develop application in cloud and expend a lot of money on testing it. But we can develop on low computing resources we can scale the project in cloud and high computing distributed environment as per need basis.

This project use more formatted input data as input. But in future we can extend this project to use unstructured data from the sensor as input data and perform the visualization operation. Furthermore, we can scale up this project to the cloud and leverage the use of big data technology.

References

[1] K. Coble, A. Mishra, S. Ferrell, and T. Griffin, “Big Data in Agriculture: A Challenge for the Future,” Appl. Econ. Perspect. Policy, vol. 40, pp. 79–96, Mar. 2018.

[2] “Apache SparkTM – Unified Analytics Engine for Big Data.” [Online]. Available: https://spark.apache.org/. [Accessed: 20-Nov-2019].

[3] “Open Source Search: The Creators of Elasticsearch, ELK Stack & Kibana | Elastic.” [Online]. Available: https://www.elastic.co/. [Accessed: 20-Nov-2019].

[4] “sbt – The interactive build tool.” [Online]. Available: https://www.scala-sbt.org/. [Accessed: 20-Nov-2019].

[5] N. Shah, D. Willick, and V. Mago, “A framework for social media data analytics using Elasticsearch and Kibana,” Wirel. Networks, Dec. 2018.

[6] R. Taylor, M. H. Ali, and I. Varley, “Automating the processing of data in research. A proof of concept using elasticsearch,” Int. J. Surg., vol. 55, p. S41, Jul. 2018.

[7] D. Chen et al., “Real-Time or Near Real-Time Persisting Daily Healthcare Data into HDFS and ElasticSearch Index inside a Big Data Platform,” IEEE Trans. Ind. Informatics, vol. PP, p. 1, Dec. 2016.

[8] F. Hu et al., “ClimateSpark: An in-memory distributed computing framework for big climate data analytics,” Comput. Geosci., vol. 115, pp. 154–166, Jun. 2018.

[9] K. Soomro, M. N. M. Bhutta, Z. Khan, and M. A. Tahir, “Smart city big data analytics: An advanced review,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. Wiley-Blackwell, 2019.

[10] R. Palamuttam et al., “SciSpark: Applying in-memory distributed computing to weather event detection and tracking,” in Proceedings – 2015 IEEE International Conference on Big Data, IEEE Big Data 2015, 2015, pp. 2020–2026.

[11] M. K. J, S. Dubey, B. B., D. Rao, and D. Rao, “Data Visualization on GitHub Repository Parameters Using Elastic Search and Kibana,” in 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), 2018, pp. 554–558.

 

About sgc908

Graduate Research Assistant at North Dakota State University, Precision Agriculture, Machine Learning, Deep Learning and Big Data.

View all posts by sgc908 →