Ensemble based J48 and random forest based C6H6 air pollution detection

Air pollution has become a critical challenge for today’s world. An efficient monitoring of air pollution gases can help to reduce the pollution in the air. Air pollution cause us many diseases such as cancer etc. Benzene (C6H6) turn out to be more challenging issue in our society, because its sensors are costly to deploy and also not feasible to add too many sensors in urban areas. Therefore, in this paper an efficient monitoring of C6H6 gas has been done by using the ensemble approach. It is feasible to estimate C6H6 by using machine learning because there exists relationship between gases. Extensive experiments have been carried out to evaluate the effectiveness of the proposed technique. It has been found that the proposed technique significantly improves the performance of existing machine learning techniques.


INTRODUCTION
The adverse global environmental changes in respect of its atmosphere, such as, enormously rapid increment of greenhouse gas concentrations, air quality degradation, increase in the abundance of tropospheric oxidants including ozone, stratospheric ozone depletion, concomitant global warming followed by looming threat of climate change and bio diversity degeneration all are fuelled by human activities. [27]. the sources responsible for air pollution are of two categories which are natural sources and man-made sources. The natural sources include forest fires, volcanic eruption, and wind erosion of soil, natural radio activity and decomposition of organic matter by bacteria. The manmade sources are much diversified. These include automobile, industries, thermal power plants and agricultural activities. The fossils fuels (coal, oil, natural gas) are burnt in industries, thermal power plants and automobiles.Different hydrocarbons (methane, butane, ethylene, benzene) and suspended particulate matters (dust, lead cadmium, chromium, arsenic salt etc.) are also present in these emissions. These gases and suspended particulate matter (SPM) produced as result of burning fossils fuels are the greatest source of air pollution. The pollutants released from natural sources of air pollution are dispersed in a vast area and do not cause any serious damage.
Most of the health related air pollutants come from manmade sources of air pollution. In large cities, breathing the polluted air proves harmful to human health. Carbon monoxide, a serious air pollutant, reduces the oxygen carrying capacity of blood and causes nausea, headache, muscular weakness and slurring of speed. Oxides of nitrogen can damage the lungs, heart and kidneys of man and other creatures. The presence of hydrocarbon in air causes irritation to eyes, bronchial construction, sneezing and coughing. In densely populated cities, the air pollution may take the form of industrial smog and photo chemical smog.
Air pollution is one of the biggest public health issues confronting the world today. Air pollution is increasing at rapid rate in the world.The toxic levels of air pollution in and around world are creating quite a menace. The increase in population, emissions from industries and manufacturing activities, automobiles exhaust, etc., are reasons that are contributing to the air pollution. Many countries have declared it as major threat to human life. Currently air pollution is measured by utilizing spatially distributed sensors. However, due to sensor expenses and size limits the operational efficiency. Therefore, many researchers have proposed air pollution detection system using machine learning tools without deploying any particular kind of sensors. It reduces the cost of air pollution monitoring system. Benzene is considered to be a threat for various kinds of diseases. Therefore, an efficient monitoring of benzene becomes a challenging issue. Air pollutants such as benzene (C6H6) have accelerated the rate of cancer among human beings. Currently, atmospheric contamination is measured using spatially separated networks with limited sensors. However, the expenses involving multiple sensors with varying sizes limit the operational efficiency. Therefore, machine learning models to predict the concentration of benzene in the air, without deployment of actual sensors for benzene detection. It is possible because there is a relation among various atmospheric gasses and thus regression can be performed to measure C6H6 if the concentration level of other gasses is known.
Air Pollution especially in and around urban areas have become, if not the most importantecologicalas well asevolvingstatesnearby the world. Air quality issues are most complex environmental problems and numerous research studies have already reported the impacts of atmospheric pollution on human health and the environment.
Air Pollution has been demarcatedas per any constituent present in the air as a result of anthropogenic activity or natural process that causes adverse effects to human or animal health & physiology, vegetation or materials. Thousands of different chemical compounds (almost all in gaseous forms) are present in our earth's atmosphere, many of which are trace in amount and beyond practical limits of detection of ordinary analytical set-ups. Many of these gases have potential to cause adverse action on the environment either directly or by interactions with other substances in the atmosphere. Pollutants are emitted directly into the atmosphere such as Oxides of Sulphur released by scorching of fossil fuels are known as primary pollutants.Secondary air pollutants are ones that are formed as products of reactions between chemical species existing in the atmosphere, heat (the thermal state of the species) and the radiations coming from the Sun. The most important of such secondary air pollutants is ozone (O3) that is produced trough complex photochemical reactions involving the primary air pollutant NO2 and aerobic oxygen under the influence of solar radiation of wavelength less than 424 nm. Concentration, location and time scale are important features that characterize the air pollution phenomenon of the atmosphere. Apart from these, meteorological conditions formulate the appropriate foundation of understanding of air pollution episode of a given region.
Sources of air pollution can be allocated into four kinds: mobile sources, stationary sources, area sources and natural sources as shown in Fig. 1.

Consequences of air pollution
Polluted air is hazardous for health. Higher concentrations of pollutants cause breathing difficulties, chronic cough, and respiratory diseases. Higher level pollutants are injurious for lung function. According to estimation of world health organization during 2002, indoor air pollution was liable for 1.5 million people death whereas mortality of 2.4 million people directly attributable to air pollution in each year (WHO, 2002). In America, more than 500,000 people died in each year from cardiopulmonary disease related to breathing fine particulate air pollutants (American Chemical Society). 527700 numbers of people died due to air pollution in India (WHO, 2002).

Ozone (O3):
Ozone that one is dreary and imperceptible, howeverregularlyensuesalongside with additional more observabletypein significant pollution proceedings. It is a gas that can form by a set of photochemical reactions in the presence of sun light and primary air pollutants. 2. Nitrogen Oxides (NOx=NO2+NO): NOx is the generic term for a group of highly reactive gases, which hold nitrogen as well as oxygen in variablequantities, such as nitric oxide (NO) as well as nitrogen dioxide (NO2).

Carbon Monoxide (CO): Carbon monoxide is an odor-
less, color-less toxic gas. CO generates from incomplete combustion of fossil fuel like unvented kerosene,gas space heaters 4. Volatile Organic Compounds (VOCs): Volatile Organic Compounds (VOCs) is a collection of chemicals that comprise organic carbon, and readily disappear, changeable from liquids to gases whileshowing to air at normal temperature. 5. Sulfur Dioxide (SO2): Sulfur Dioxide is a colourless gas. It is transformed into sulphuric acid in the presence of water vapor. SO2 can be oxidized to form acid aerosols.
The impact of this paper is to improveinfluentialnumericallinkbetweenseveralimpurities. The impurityfeatures which will characterize the dataset must be generic. Most importantrecipesopinions are as follows:  The usage of combination of data mining methodscould be done to expand the accuracy rate supplementary for recognition of benzene.  The combination of random forest and J48 has been disregardedthat can expand accuracy rate supplementary for benzene detection.  The influence of performance metrics tuning is also unnoticed in presentworks.  Hence, the studyeffort will becombined machine learning method which willcalculate the benzene from air fumes data in an effectivemode.

RELATED WORK
This section contains comprehensive review on existing well-known air quality prediction techniques by various researchers.
Siwek and Osowski (2016) [8] have assessed the utilization of neural networks together with on field information recordings for adjusting a multi-sensor device for benzene estimation. The situation is described by huge connections among a few contamination groups. The proposed sensor combination subsystem has been chosen for exploiting both single sensor specificity and situation related connections. Vlachokostas et al. (2011) [9] have discussed that there exist steady relationship between traffic-related air contamination and respiratory symptoms. Be that as it may, numerous urban regions are depicted by the nonappearance of the vital observing foundation, particularly for benzene (C6H6), which is a known human cancer-causing agent. The exhibited outcomes illustrated that the adopted approach is equipped for predicting C6H6 and ought to be considered as correlative to air quality predicting. Singh et al. (2013) [20] developed tree ensemble models for seasonal discrimination and air quality prediction. PCA (Principal Component Analysis) used to identify air pollution sources; air quality indices used for health risk. Bagging and boosting algorithms enhanced predictive ability of ensemble models. Ensemble classification and regression models performed better than SVMs. Proposed models can be used as tools for air quality prediction and management. Qi et al. (2017) [24] proposed a general and effective approach to solve the three problems in one model called the Deep Air Learning (DAL). The main idea of DAL lies in embedding feature selection and semisupervised learning in different layers of the deep learning network. The proposed approach utilizes the information pertaining to the unlabelled spatio-temporal data to improve the performance of the interpolation and the prediction, and performs feature selection and association analysis to reveal the main relevant features to the variation of the air quality. Researchers evaluated approach with extensive experiments based on real data sources obtained in Beijing, China. Experiments show that DAL is superior to the peer models from the recent literature when solving the topics of interpolation, prediction and feature analysis of fine-gained air quality.
The primary motivation behind this research work comes after conducting the survey of existing techniques through which followings gaps have been formulated: - Sufficient data: To improve proficient machine learning method, it is vital to have appropriate quantity of information for evolving accomplished method.  Noise in data: Noise in information could mention as annoying information existent in database. Similarly, out of series information could also be termed as noise.  Pre-mature convergence:It has been observed that majority of existing meta-heuristic-based machine learning models such as particle swarm optimization, genetic algorithm etc. suffer from pre-mature convergence issue. It limits the performance of air pollution prediction techniques.  Data uncertainty:It has been found that the type-I fuzzy logic has not ability to support the degree of uncertainty. Therefore, it is required to design an efficient type-II fuzzy logic to improve the accuracy rate of the benzene (C6H6) prediction.  Stuck in local optima:Majority of existing metaheuristic-based benzene prediction models suffer from stuck in local optima issue.  Computational speed: The majority of existing metaheuristic based machine learning models suffer from poor computational speed.

METHODOLGY
Followings are the key assistances of this proposed method:  This proposed work will attain improved accuracy as compared to machine learning method.  air pollution detection method is appropriate for this proposed work since it usages minus space  It recognizes benzene with the help of regression based methods. Proposed method will detect the association among additional gases with C6H6.
This work will utilize step by step methodology to attain the objectives of this research work. The proposed method is considered and applied in MATLAB software. Intel core i5 mainframe is applied with 8GB RAM and 2GB graphics card. In order to evaluate proposed model and perform a comparative analysis, following parameters were used: 1. Accuracy Table 1 and 2 shows the accuracy analysis between the proposed methods as compared to other methods. In both Tables, information is trained and tested on similar dataset, so that's why named as training accuracy.
Therefore Table 1 and 2 reveal that the accuracy metric of the proposed method better outcome other methods such as linear model, neural network, Support vector machine (SVM) and J48.     A table 3 and 4 shows the correlation analysis between proposed method and others. As identified in previous, correlation lies between [-1 1] and positive correlation approaches to 0 indicate that the proposed method provides significant results over other methods. Therefore, from Tables 4 and 5 it has been observed that the proposed method provides more significant results compared to earlier approaches.    Tables 5 and 6 demonstrate the root means squared error (RMSE) analysis between proposed and other machine learning approaches. RMSE represent the difference among actual and predicted C6H6 values. Therefore, it should be minimum. From Tables 6 and 7 it has been observed that the proposed method provide lesser RMSE compared to others, therefore proposed method provides more significant C6H6 results.       Tables 9 and 10 depict the evaluation analysis regarding Coefficient regarding determination (R). It is the quotient on the variances of the installed beliefs plus witnessed beliefs on the dependent variable. R is a statistic which will provide some good info concerning the rewards regarding fit of the model. Within regression, the R is a statistical measure of how good the regression range approximates the genuine data points. A R regarding 1 signifies the fact that regression range correctly suits the data. From Tables 9 and 10 has been observed that the prosed method has better R as compared to other methods.    Figure 14: Analysis of Testing Coefficient of determination(R)

5.CONCLUSIONS AND FUTURE WORK
Human expertise of benzene is from a range of discerning and long-term unfavourable well being benefits and ailments, which include most cancers and aplastic anaemia. Direct exposure can happen occupationally and domestically on account of this huge using benzene-containing petroleum items, which include motor unit powers and solvents. Effective and passive expertise of cigarette can also be a important method of obtaining exposure. Benzene is especially erratic, and exposure comes about generally by means of inhalation. Public well being measures are needed to lessen the exposure connected with either employees and the typical human population to help benzene. Benzene (C6H6), simplest natural, great smelling hydrocarbon and also parent element of several significant great smelling compounds. Benzene is actually a colourless liquid which has a characteristic smell and is also principally utilised in producing polystyrene. The idea is extremely harmful and is also a regarded carcinogen; contact it may cause leukemia. Therefore, there are rigorous handles upon benzene emissions. The use of ensembling of data mining techniques have been done to improve the accuracy rate Benzene (C6H6) detection machine learning techniques. It has been achieved by using the integration of random forest and J48 based machine learning techniques. Root mean squared error (RMSE) tuning has also been achieved. Initially, to evaluate the performance of existing machine learning techniques for detection of C6H6. Then, proposed technique is implemented to evaluate the C6H6 in air. Then, comparisons have been done between the existing machine learning algorithms and proposed technique using:a) Root mean squared error, b) Correlation, c) Accuracy, d) Error rate and e) Coefficient of determination.

Future work
Subsequent section describes future directions for the proposed work: 1. Ensembling of Random forest and J48 based machine learning technique does notguarantee the lowest error rate because random forest is limited to number of trees only. Therefore, in near future meta-heuristic techniques such as ant colony optimization, artificial bee colony etc. approaches will be considered to enhance the results further.
2. Also, in near future proposed technique will be applied on other fields such as biomedical processing, image machine learning etc. to evaluate the performance of the proposed technique for other applications.
3. Also, proposed technique will be applied on real-time data taken from sensors.