TOPICS present in the databases. For instance, the

 

TOPICS IN DATA SCIENCE

CP-8210

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

FINAL REPORT

STRUCTURED
AND UNSTRUCTURED DATA

 

 

Submitted to :- Abdolreza Abhari

 

 

 

 

Submitted by :-      Gurpreet
Singh

Student Number:-  500802475

 

DATE 21/Dec/2017

Introduction

 

Data mining is
a process which is used to turn raw data into useful information by various
companies. With the help of data mining, the companies can look into patterns
and understand the customers in a better way with more effective strategies
which will further increase their sale and decrease the prices.

It is a basic
procedure where insightful techniques are connected to remove information
designs. It is an interdisciplinary subfield of software engineering. The
general objective of the information mining process is to separate data from an
informational index and change it into a reasonable structure for additionally
utilize. Beside the crude examination step, it includes database and
information administration angles, information pre-preparing, model and
surmising contemplations, intriguing quality measurements, unpredictability
contemplations, post-handling of found structures, representation, and online
updating. Data mining is the investigation venture of the “learning
revelation in databases” process, or KDD

 

The data is
stored electronically & the search is automatic by computer in data mining.
Its not even new, statisticians and engineers have been working from long that
patterns in the data can be solved automatically and also validated and could
be used for predictions. With the growth in database, it almost gets doubled in
every 20 months, so its very difficult in quantitative sense. The opportunities
for data mining will increase definitely, as the world will grow in complexity,
the data it generates, so data mining is the only hope for elucidating of the
hidden patterns. The data which is intelligently analysed is a very valuable
resource, which can lead to new insights further has various advantages.

 

Data mining is
all about the solution of the problems with the analysing of data which is
already present in the databases. For instance, the problem of customers
loyalty in the highly competitive market. 
The key to this problem is the database of customer choices with their
profiles. The behaviour pattern of former customers can be used to analyse the
characteristics of those who remains loyal and those who change products. They
can easily characterise the customers to identify them who care willing to jump
the ship. Those groups can be identified and can be targeted with the special treatment.
Same technique can be used to know the customers who are attracted to other
services. So, in todays competitive world, data is the material which can
increase the growth of any business, only if it is mined.

 

 

 

 

 

And how are the patterns expressed?

The nontrival predictions on new data are allowed with the help of useful
patterns. There are two ways to express the pattern :- as a black box whose
inwards are incomprehensible and the other one is a transparent box whose
construction reveals the structure of the pattern. Assuming, both can make good
predictions. The difference among both is that whether or not the mined
patterns are represented in way of structure, which can be used to form future
decisions. These kind of patterns are known as structural as they do capture
the decision structure in an excellent manner. They basically help to tell or
explain something about the data.

 

Describing Structural Patterns

 What are structural patterns?

It is described below with the help of an illustration which is under as
follows :-

 

If tear production rate = reduced then recommendation =
none

Otherwise, if age =
young and astigmatic
= no then

recommendation = soft

 

Structural
descriptions need not necessarily be couched as rules such as these. Decision trees,
which specify the sequences of decisions that need to be made along with the
resulting recommendation, are another popular means of expression.

This
example is a very simplistic one. For a start, all combinations of possible values
are represented in the table. There are 24 rows, representing three possible

 

 

values
of age and two values each for spectacle prescription, astigmatism, and tear

production
rate (3 × 2 × 2 × 2 = 24). The rules do not really generalize from the

data;
they merely summarize it. In most learning situations, the set of examples
given

as
input is far from complete, and part of the job is to generalize to other, new

examples.
You can imagine omitting some of the rows in the table for which the tear

production
rate is reduced and
still coming up with the rule

If tear production rate
= reduced
then recommendation = none

This
would generalize to the missing rows and fill them in correctly. Second, values

are
specified for all the features in all the examples. Real-life datasets
invariably

contain
examples in which the values of some features, for some reason or other,

are unknown—for example,
measurements were not taken or were lost. Third, the

preceding
rules classify the examples correctly, whereas often, because of errors or

noise in the data, misclassifications occur even on
the data that is used to create the

classifier.

 

Data Mining

 

The techniques which are used for learning and doesn’t represent conceptual problems are known as machine
learning. Data mining is a procedure which involves learning in practical, not
much theoretical. We will find out techniques to find structural patterns, and
to make predictions from the data.  The
information/knowledge will be collected from the data, as an example clients
which have switched loyalties.

The prediction is made whether a customer will be switching the loyalty
under different circumstances, but the output might also include the exact
description of the structure that can be utilised to group the unknown
examples.

And in addition, it is useful to supply an explicit portrayal of the
learning that is gained. Fundamentally, this reflects the two meanings of
learning considered over: the securing of information and the capacity to
utilize it. Many learning procedures search for structural depictions of what
is found out—portrayals
that can turn out to be genuinely unpredictable and are typically communicated
as sets of guidelines, for example, the ones portrayed already or the decision
trees portrayed. Since they can be comprehended by individuals, these
depictions serve to clarify what has been realized—at the end of the day, to clarify the reason for new
prediction.

 

 

The past
experience tells us that in most of the applications of data mining, the
knowledge structure, the structural descriptions are very important as much as to
perform on new instances. Data mining is usually used by people to gain
knowledge, not only the predictions. It sounds like a good idea to gain
knowledge from the available data.

 

Data mining deals with the kind of patterns that
can be mined. On the basis of the kind of data to be mined, there are two
categories of functions involved in Data Mining ?

Descriptive
Classification and Prediction

Descriptive Function

The descriptive function deals with the general
properties of data in the database. Here is the list of descriptive functions ?

Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters

Class/Concept Description

Class/Concept alludes to the data to be related
with the classes or ideas. For instance, in an organization, the classes of
things for deals incorporate PC and printers, and ideas of clients incorporate
enormous spenders and budget spenders. Such depictions of a class or an idea
are called class/idea portrayals. These depictions can be inferred by the
accompanying two ways –

 

·     
Data Characterization ? This refers to summarizing data of class
under study. This class under study is called as Target Class.

·     
Data Discrimination ? It refers to the mapping or classification
of a class with some predefined group or class.

Mining of Frequent Patterns

Frequent patterns are those patterns that occur
frequently in transactional data. Here is the list of kind of frequent patterns
?

·     
Frequent Item Set ? It refers to a set of items that frequently
appear together, for example, milk and bread.

·     
Frequent Subsequence ? A sequence of patterns that occur
frequently such as purchasing a camera is followed by memory card.

·     
Frequent Sub
Structure ? Substructure refers to
different structural forms, such as graphs, trees, or lattices, which may be
combined with item-sets or subsequences.

Mining of Association

Associations are used in retail sales to identify
patterns that are frequently purchased together. This process refers to the
process of uncovering the relationship among data and determining association
rules.

For example, a retailer generates an association
rule that shows that 70% of time milk is sold with bread and only 30% of times
biscuits are sold with bread.

Mining of Correlations

It is a kind of additional analysis performed to
uncover interesting statistical correlations between associated-attribute-value
pairs or between two item sets to analyze that if they have positive, negative
or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kind of
objects. Cluster analysis refers to forming group of objects that are very
similar to each other but are highly different from the objects in other clusters.

Classification and Prediction

Classification is the process of finding a model
that describes the data classes or concepts. The purpose is to be able to use
this model to predict the class of objects whose class label is unknown. This
derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms ?

Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks

The list of functions involved in these processes
are as follows ?

·     
Classification ? It predicts the class of objects whose
class label is unknown. Its objective is to find a derived model that describes
and distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.

·     
Prediction ? It is used to predict missing or
unavailable numerical data values rather than class labels. Regression Analysis
is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.

·     
Outlier Analysis ? Outliers may be defined as the data objects
that do not comply with the general behavior or model of the data available.

·     
Evolution Analysis ? Evolution analysis refers to the
description and model regularities or trends for objects whose behavior changes
over time.

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining
query.
This query is input to the system.
A data mining query is defined in terms of data mining task
primitives.

Note ?
These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives ?

Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.

Set of task relevant data to be mined

This is the portion of database in which the user
is interested. This portion includes the following ?

Database Attributes
Data Warehouse dimensions of interest

Kind of knowledge to be mined

It refers to the kind of functions to be performed.
These functions are ?

Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis

Background knowledge

The background knowledge allows data to be mined at
multiple levels of abstraction. For example, the Concept hierarchies are one of
the background knowledge that allows data to be mined at multiple levels of
abstraction.

Interestingness measures and thresholds for pattern
evaluation

This is used to evaluate the patterns that are
discovered by the process of knowledge discovery. There are different
interesting measures for different kind of knowledge.

Representation for visualizing the discovered
patterns

This refers to the form in which discovered
patterns are to be displayed. These representations may include the following.
?

Rules
Tables
Charts
Graphs
Decision Trees
Cubes

 

Issuessssssssssssssssssssssss

Data mining isn’t a simple task, as the calculations
utilized can get exceptionally perplexing and data isn’t generally accessible
at one place. It should be coordinated from different heterogeneous information
sources. These components likewise make a few issues. Here in this
instructional exercise, we will talk about the significant issues with respect
to ?

Mining Methodology and User Interaction
Issues in Performance
Issues in Diverse data types

The following diagram describes the major issues.

 

Mining Methodology and User
Interaction Issues

It refers to the following kinds of issues –

•        Mining
various types of information in databases ? Different clients might be keen on
various types of learning. In this way it is important for data mining to cover
a wide scope of learning revelation task.

 

•        Interactive
mining of learning at various levels of deliberation ? The data mining process
should be intuitive on the grounds that it enables clients to center the scan
for patterns, giving and refining data mining demands in light of the returned
comes about.

 

Handling noisy or incomplete data ? The data cleaning techniques are required to deal with the
clamor and deficient articles while mining the information regularities.
On the off chance that the data cleaning techniques are not there then the
precision of the found examples will be poor.

 

·     
Pattern evaluation – The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as
follows ?

·     
Efficiency and
scalability of data mining algorithms ?
In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

•         Parallel,
circulated, and incremental mining calculations ? The components, for example,
tremendous size of databases, wide appropriation of data, and many-sided
quality of data mining techniques rouse the advancement of parallel and
conveyed information mining calculations. These calculations isolate the
information into allotments which is additionally prepared in a parallel mold.
At that point the outcomes from the partitions is consolidated. The incremental
calculations, refresh databases without mining the information again starting
with no outside help.

·     
 

Diverse Data Types Issues

·     
Handling of
relational and complex types of data ?
The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all these
kind of data.

·     
Mining information
from heterogeneous databases and global information systems ? The data is available at different data
sources on LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

 

 

 

 

 

Applications

Data Mining
Applications in Sales/Marketing

The hidden pattern inside historical purchasing
transactions data are better understood with the help of data mining. Which enables
the launch of new campaigns in the market in a cost-efficient way. The data
mining applications are described as under :-

Data
mining is used for market basket analysis to provide information on what
product combinations were purchased together when they were bought and in
what sequence.  This information helps businesses promote their most
profitable products and maximize the profit. In addition, it
encourages customers to purchase related products that they may have been
missed or overlooked.
The
buying pattern of customer’s behaviour is identified by retail companies
with the use of data mining.

 

Data Mining Applications
in Banking / Finance

The
data mining technique is used to help identifying the credit card fraud
detection.
Customer’s
loyalty is identified by data mining techniques , i.e by analysing the purchasing
activities of customers, for example the information of recurrence of
procurement in a timeframe, an aggregate fiscal value of all buys and when
was the last buy. In the wake of dissecting those measurements, the
relative measure is created for every client. The higher of the score, the
more relative faithful the client is.
By
using data mining, credit card spending by the customers can be identified
Data
mining also helps in identifying the rules of stock trading from historical
data.

 

 

 

Data Mining
Applications in Health Care and Insurance

 

The development of the insurance business altogether relies
upon the capacity to convert data into the learning, data or knowledge about
clients, contenders, and its business sectors. Data mining is connected in insurance
industry of late however conveyed gigantic upper hands to the organizations who
have actualized it effectively. The data mining applications in the protection
business are as under:

 

•            Data
mining is connected in claims investigation, for example, distinguishing which medical
methodology         are asserted together.

•            Data
mining empowers to forecasts which clients will conceivably buy new policies.

•            Data
mining permits insurance agencies to identify dangerous clients’ behaviour
patterns.

•            Data
mining recognizes deceitful behaviour.