Showing posts with label Data Warehouse and Mining. Show all posts
Showing posts with label Data Warehouse and Mining. Show all posts

Write short note on Decision Tree based Classification Approach

  • Training dataset should be class-labeled for learning of decision trees in decision tree induction.
  • A decision tree represents rules and it is very a popular tool for classification and prediction.
  • Rules are easy to understand and can be directly used in SQL to retrieve the records from the database.
  • To recognize and approve the discovered knowledge acquired from decision model is a crucial task.
  • There are many algorithms to build decision tree:
    • ID3 (Iterative Dichotomiser)
    • C4.5 (Successor of ID3)
    • CART (Classification and Regression Tree)
    • CHAID (Chi-square Automatic Interaction Detector)
Decision Tree representation
  • A decision tree classifier has a tree type structure which has leaf-nodes and decision nodes.
  • A leaf node is that last node of each branch and indicates the class label or value of a target attribute.
  • A decision node is the node of a tree which has leaf node or sub-tree. Some test to be carried on each value of decision node to get the decision of class label or to get next sub-tree.
Decision Tree represents for play tennis

Describe the various functionalities of Data Mining as a step in the process of Knowledge Discovery.


  • Knowledge discovery in the database is the process of searching for hidden knowledge in the massive amounts of data that we are technically capable of generating and storing.
  • The basic task of KDD is to extract knowledge (or information) from a lower level data (databases).
  • It is the non-trivial (significant) process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
  • The goal is to distinguish between unprocessed data something that may not be obvious but is valuable or enlightening in its discovery.
  • The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:
    1. Data Cleaning
    2. Data Integration
    3. Data Selection
    4. Data Transformation
    5. Data Mining
    6. Pattern Evaluation
    7. Knowledge Presentation
  • Steps 1 to 4 are different forms of data preprocessing, where data are prepared for mining.
Fig. Data Mining as a step in the process of knowledge discovery
  1. Data Cleaning
    • Removal of noise, inconsistent data, and outliers
    • Strategies to handle missing data fields.
  2. Data Integration
    • Data from various sources such as databases, data warehouse, and transactional data are integrated.
    • where multiple data sources may be combined into a single data format.
  3. Data Selection
    • Data relevant to the analysis task is retrieved from the database.
    • Collecting only necessary information to the model.
    • Finding useful features to represent data depending on the goal of the task.
  4. Data Transformation
    • Data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations.
    • By using transformation methods invariant representations for the data is found.
  5. Data Mining
    • An essential process where intelligent methods are applied to extract data patterns.
    • Deciding which model and parameter may be appropriate.
  6. Pattern Evaluation
    • To identify the truly interesting patterns representing knowledge based on interesting measures.
  7. Knowledge Presentation
    • Visualization and knowledge representation techniques are used to present mined knowledge to users.
    • Visualizations can be in form of graphs, charts or table.

What is meant by metadata in the context of a Data warehouse? Explain different types of metadata stored in a data warehouse.


  • Metadata are data about data.When used in a data warehouse, metadata are the data that define warehouse objects.
  • Metadata are created for the data names and definitions of given warehouse.
  • Additional metadata are created and captured for timestamping any extracted data, the source of the extracted data and missing fields that have been added by data cleaning or integration process.
  • Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.
  • Metadata is the road map to a data warehouse.
  • Metadata can be broadly categorized into three types:
  1. Business Metadata: It has the data ownership information, business definition and changing policies.
  2. Technical Metadata: It includes database system names, table and column names and sizes, data types and allowed values. It also includes structural information such as primary and foreign key attributes and indices.
  3. Operational Metadata: It includes currency of data and data lineage. Currency of data means whether the data is active, archived or purged. Lineage of data means the history of data migrated and transformation applied on it. 

  • Metadata play a very important role than other data warehouse data and are important for many reasons. For example, metadata are used as a directory to help the decision support system analyst to locate the contents of the data warehouse and as a guide to the data mapping when data are transformed from the operational environment to the data warehouse environment.
  • Metadata also serve as a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data, and between the lighly summarized data and higly summarized data.
  • Metadata should be stored and managed persistently (i.e., on disk).

Fact less Fact table

  • A fact less fact table is a fact table that does not have any measures i.e, it captures the many-to-many relationships between dimensions but contains no numeric or textual facts. It is essentially an intersection of dimensions.
  • On the surface, a factless fact table does not make sense since a fact table is after all about facts.
  • However, there are situations where having this kind of relationship makes sense in data warehousing.
  • For example,
    • Think about a record of students attendance in a class.
    • In this case, the fact table would consist of three dimensions: student dimension, time dimension, and class dimension.
    • This factless fact table would look like:
    • Fact_Attendance
      Student_ID
      Class_ID
      Time_ID
  • Factless fact tables offer the most flexibility in data warehouse design.
  • For eg, one can easily answer the following questions with the factless fact table:
    • How many students attended a particular class on a particular day?
    • How many classes on average does student attend on a given day?
  • Without using a factless fact table, we need to separate two fact tables to answer the above two questions.

Difference between OLTP and OLAP.


OLTP OLAP
OLTP, Online Transaction Processing are the operational data. OLAP, Online Analytical Processing are the consolidation data.
OLTP are the orignal source data. OLAP data comes from various OLTP databases.
It's purpose is to control and run fundamental business task. To help with planning, problem solving and decision support.
Short and fast inserts and updates initiated by end users. Periodic long running batch jobs refresh the data.
Relatively standardized and simple queries. Often complex queries involving aggregations.
Typically very fast procesising speed. Processing speed depends on the amount of data involved. Complex queries may take many hours.
Space: can be relatively small if historical data is archived. Space: Larger due to the existence of aggregation structures and history data.
Highly normalized with many tables. Typically de-normalized with fewer tables. Use of star and snowflake schema.
OLTP systems are used by clerks, DBAs or database professional. OLAP systems are used by knowledge workers such as executives, managers and analyst.
It focuses on Data-In. It focuses on Data-out.
Provides detailed and flat relational view of data. Provide summarized and multi-dimensional view of data.
Number of record accessed is in tens. Number of record accessed is in millions.
Database size is from 100MB to 1 GB Database size is from 100 GB to 1 TB.
Provides high performance. Highly flexible.

Difference between Data Warehouse vs Data Mart

Here the differences in a data warehouse and data mart.
It is one of the most important question as it is been asked in many exams as well as in viva exams

Data Warehouse Data Mart
1) It gives enterprise wide view of data. 1) It gives departmental view of data.
2) Union of all data marts. 2) Subset of data warehouse or single business process.
3) Takes longer time to implement. 3) Takes less time to implement.
4) Size is more than 100TB. 4) Size is less than 10TB.
5) Slower response. 5) Faster response.
6) Detailed Information. 6) Summarized data.
7) It is difficult to build. 7) Is easy to build.
8) Complex and difficult to manage. 8) Easy to design and use/manage.