DWDM 2024 question paper with solution
Question paper DWDM 2024. https://www.abesit.in/wp-content/uploads/2024/11/KOE093-DATA-WAREHOUSING-DATA-MINING.pdf
Solution DWDM 2024 Solution of DWDM paper 2024
SECTION A – Attempt all questions in brief (2 marks each)
a. Define Data Warehousing.
A data warehouse is a centralized repository that stores integrated data from multiple sources, enabling analysis, reporting, and decision-making.
b. Discuss the Fact Constellation.
Fact Constellation is a schema that contains multiple fact tables sharing dimension tables, also known as galaxy schema, suitable for complex databases.
c. Explain Distributed DBMS implementation.
A Distributed DBMS manages a database distributed across multiple locations, ensuring transparency, consistency, and coordination of data across sites.
d. Define Warehousing Software.
Warehousing software is a tool or application used to construct, manage, and access a data warehouse for efficient data integration, analysis, and reporting.
e. Are all the patterns interesting?
No, not all patterns are interesting; only those that are valid, novel, useful, and potentially actionable are considered interesting in data mining.
f. Differentiate between binary symmetric attributes and asymmetric attributes.
Symmetric: Equal importance (e.g., gender).
Asymmetric: One value is more important (e.g., presence of disease: yes/no).
g. Find mode of: 12,13,34,32,21,29,40,11,39,32. What is advantage of mode over mean and median?
Mode = 32 (occurs twice)
Advantage: Mode is not affected by outliers and best represents categorical data.
h. Given tuples (22, 4, 25, 10) and (20, 10, 26, 2), compute Manhattan distance.
|22 - 20| + |4 - 10| + |25 - 26| + |10 - 2| = 2 + 6 + 1 + 8 = 17
i. What do you mean by Temporal Mining?
Temporal Mining is the process of discovering patterns and relationships in temporal data, i.e., data with time-based attributes.
j. Discuss Data Visualization.
Data Visualization refers to graphical representation of data and findings (e.g., charts, graphs) to make analysis and trends easier to understand.
SECTION B – Attempt any three (3 × 10 = 30 marks)
a. Write short notes on:
i. Steps of Knowledge Discovery in Data (KDD):
Data Cleaning: Remove noise and inconsistencies.
Data Integration: Combine data from multiple sources.
Data Selection: Choose relevant data for analysis.
Data Transformation: Convert data into suitable format.
Data Mining: Apply algorithms to extract patterns.
Pattern Evaluation: Identify interesting patterns.
Knowledge Presentation: Visualize and present mined knowledge.
ii. Explain Snowflake Schema in detail:
A snowflake schema is a logical arrangement of tables in a multidimensional database that resembles a snowflake shape.
It normalizes dimension tables into multiple related tables.
Example: "Customer" dimension may be split into "Customer" and "Address" tables.
Advantages: Reduces redundancy.
Disadvantages: More complex joins.
b. Explain Market Basket Analysis.
Market Basket Analysis is a data mining technique used to identify purchasing patterns by analyzing the co-occurrence of items in transactions.
Association Rules: A → B [support, confidence]
Example: If customers buy bread and butter, they are likely to buy milk.
Useful in cross-selling, promotions, and store layout.
c. Draw box-and-whisker plot for dataset: 4.2, 5.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4
Steps:
Order data: 3.9, 4.1, 4.2, 4.2, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.3
Minimum: 3.9
Q1 (25th percentile): 4.3
Median (Q2): 4.45
Q3 (75th percentile): 4.8
Maximum: 5.3
Draw a box from Q1 to Q3, mark median, extend whiskers to min and max.
d. Cluster points using K-Means (Euclidean distance): (2,4), (6,8), (1,2), (4,5), (3,5)
Assume initial centroids: C1=(2,4), C2=(6,8)
Step 1: Assign clusters based on Euclidean distance
Calculate distance to each centroid and assign:
(2,4) → C1
(6,8) → C2
(1,2) → C1
(4,5) → C1
(3,5) → C1
Step 2: Recalculate Centroid of C1:
Mean of (2,4), (1,2), (4,5), (3,5)
= ((2+1+4+3)/4, (4+2+5+5)/4) = (2.5, 4)
New centroids:
C1 = (2.5, 4), C2 = (6,8)
Repeat until centroids stabilize.
e. Explain ROLAP, MOLAP, and HOLAP in detail:
Type Storage Query Performance Pros Cons
ROLAP
Relational DB
Slower
Scalable
Slower response
MOLAP
Multidimensional Cubes
Fast
Fast analysis
Limited data size
HOLAP
Hybrid of ROLAP + MOLAP
Balanced
Combines advantages
Complex
SECTION A – Attempt all questions in brief (2 marks each)
a. Define Data Warehousing.
A data warehouse is a centralized repository that stores integrated data from multiple sources, enabling analysis, reporting, and decision-making.
b. Discuss the Fact Constellation.
Fact Constellation is a schema that contains multiple fact tables sharing dimension tables, also known as galaxy schema, suitable for complex databases.
c. Explain Distributed DBMS implementation.
A Distributed DBMS manages a database distributed across multiple locations, ensuring transparency, consistency, and coordination of data across sites.
d. Define Warehousing Software.
Warehousing software is a tool or application used to construct, manage, and access a data warehouse for efficient data integration, analysis, and reporting.
e. Are all the patterns interesting?
No, not all patterns are interesting; only those that are valid, novel, useful, and potentially actionable are considered interesting in data mining.
f. Differentiate between binary symmetric attributes and asymmetric attributes.
Symmetric: Equal importance (e.g., gender).
Asymmetric: One value is more important (e.g., presence of disease: yes/no).
g. Find mode of: 12,13,34,32,21,29,40,11,39,32. What is advantage of mode over mean and median?
Mode = 32 (occurs twice)
Advantage: Mode is not affected by outliers and best represents categorical data.
h. Given tuples (22, 4, 25, 10) and (20, 10, 26, 2), compute Manhattan distance.
|22 - 20| + |4 - 10| + |25 - 26| + |10 - 2| = 2 + 6 + 1 + 8 = 17
i. What do you mean by Temporal Mining?
Temporal Mining is the process of discovering patterns and relationships in temporal data, i.e., data with time-based attributes.
j. Discuss Data Visualization.
Data Visualization refers to graphical representation of data and findings (e.g., charts, graphs) to make analysis and trends easier to understand.
Shall I continue with Section B next?
You said:
Yes
ChatGPT said:
Here’s the solution for SECTION B of the KOE093 Data Warehousing & Data Mining paper. You are required to attempt any three of the following, each carrying 10 marks.
SECTION B – Attempt any three (3 × 10 = 30 marks)
a. Write short notes on:
i. Steps of Knowledge Discovery in Data (KDD):
Data Cleaning: Remove noise and inconsistencies.
Data Integration: Combine data from multiple sources.
Data Selection: Choose relevant data for analysis.
Data Transformation: Convert data into suitable format.
Data Mining: Apply algorithms to extract patterns.
Pattern Evaluation: Identify interesting patterns.
Knowledge Presentation: Visualize and present mined knowledge.
ii. Explain Snowflake Schema in detail:
A snowflake schema is a logical arrangement of tables in a multidimensional database that resembles a snowflake shape.
It normalizes dimension tables into multiple related tables.
Example: "Customer" dimension may be split into "Customer" and "Address" tables.
Advantages: Reduces redundancy.
Disadvantages: More complex joins.
b. Explain Market Basket Analysis.
Market Basket Analysis is a data mining technique used to identify purchasing patterns by analyzing the co-occurrence of items in transactions.
Association Rules: A → B [support, confidence]
Example: If customers buy bread and butter, they are likely to buy milk.
Useful in cross-selling, promotions, and store layout.
c. Draw box-and-whisker plot for dataset: 4.2, 5.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4
Steps:
Order data: 3.9, 4.1, 4.2, 4.2, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.3
Minimum: 3.9
Q1 (25th percentile): 4.3
Median (Q2): 4.45
Q3 (75th percentile): 4.8
Maximum: 5.3
Draw a box from Q1 to Q3, mark median, extend whiskers to min and max.
d. Cluster points using K-Means (Euclidean distance): (2,4), (6,8), (1,2), (4,5), (3,5)
Assume initial centroids: C1=(2,4), C2=(6,8)
Step 1: Assign clusters based on Euclidean distance
Calculate distance to each centroid and assign:
(2,4) → C1
(6,8) → C2
(1,2) → C1
(4,5) → C1
(3,5) → C1
Step 2: Recalculate Centroid of C1:
Mean of (2,4), (1,2), (4,5), (3,5)
= ((2+1+4+3)/4, (4+2+5+5)/4) = (2.5, 4)
New centroids:
C1 = (2.5, 4), C2 = (6,8)
Repeat until centroids stabilize.
e. Explain ROLAP, MOLAP, and HOLAP in detail:
Type
Storage
Query Performance
Pros
Cons
ROLAP
Relational DB
Slower
Scalable
Slower response
MOLAP
Multidimensional Cubes
Fast
Fast analysis
Limited data size
HOLAP
Hybrid of ROLAP + MOLAP
Balanced
Combines advantages
Complex
Would you like me to proceed with Section C next?
You said:
Yes
ChatGPT said:
Here are the solutions for SECTION C of the KOE093 Data Warehousing & Data Mining paper. You need to attempt any one part from each question (Q3 and Q4), each part carrying 10 marks.
SECTION C (10 × 2 = 20 marks)
Q3. Attempt any one part
a. How mapping a 2D table into multidimensional data model? Explain with suitable example.
Mapping a 2D relational table into a multidimensional data model involves converting rows and columns into facts and dimensions.
Example: Sales Table
Date
Product
Store
Revenue
Jan 1
Laptop
Delhi
50,000
Jan 2
Mouse
Mumbai
2,000
This table can be mapped to a fact table:
Fact: Revenue
Dimensions: Date, Product, Store
Multidimensional Schema:
Fact Table: Contains revenue, foreign keys to dimensions
Dimension Tables:
Date Dimension: (Date_ID, Day, Month, Year)
Product Dimension: (Product_ID, Name, Category)
Store Dimension: (Store_ID, City, Region)
This structure supports OLAP operations like slicing, dicing, roll-up, and drill-down.
b. Write short notes on:
i. Data Characterization and Data Discrimination
Data Characterization: Summarizes the general features of a target class (e.g., average income of customers in region X).
Data Discrimination: Compares features of target class against contrasting class (e.g., loyal vs. occasional customers).
ii. Snowflake Schema in detail
[Already covered in Section B, Q2-a-ii]
Q4. Attempt any one part
a. Differentiate between:
(i) Min-Max and Z-score Normalization
Feature
Min-Max
Z-score
Formula
(X - min) / (max - min)
(X - mean) / standard deviation
Range
Scales to [0,1]
Mean = 0, Std Dev = 1
Use case
When min and max known
When data is normally distributed
Sensitive to outliers
Yes
No
(ii) Binary and Nominal data variables
Type
Binary
Nominal
Values
Two (e.g., 0/1, Yes/No)
Multiple categories (e.g., Red, Blue)
Example
Gender (M/F), Pass/Fail
City, Department Name
b. Explain major components of Data Mining Architecture.
Data Sources: Databases, data warehouses, flat files.
Data Warehouse Server: Stores pre-processed data.
Data Mining Engine: Core component performing algorithms like classification, clustering.
Pattern Evaluation Module: Identifies interesting patterns.
User Interface: Allows user interaction with mining process.
Knowledge Base: Stores metadata and domain knowledge.
SECTION D (10 × 3 = 30 marks)
Q5. Attempt any one part
a. Discuss Decision Tree-Based Classifiers in detail.
Decision tree classifiers use a tree-like model of decisions and their consequences.
Key Concepts:
Nodes: Represent tests on attributes.
Branches: Represent outcomes of tests.
Leaves: Represent class labels (decision).
Algorithm:
Common algorithms: ID3, C4.5, CART.
Use attribute selection measures (e.g., Information Gain, Gini Index) to choose the best attribute to split.
Recursively partition the dataset.
Advantages:
Easy to understand and interpret.
Handles both numerical and categorical data.
Disadvantages:
Prone to overfitting.
Sensitive to noisy data.
b. Classify the tuple X = (youth, medium, student=yes, credit_rating=fair) using Bayes Theorem.
We apply Naive Bayes Classification.
Calculate Prior Probabilities:
P(Yes) = 3/6, P(No) = 3/6
Calculate Likelihoods for each attribute:
From the table:
| Attribute | P(Attribute | Yes) | P(Attribute | No) |
|------------------|--------------|--------------|
| Age = youth | 1/3 | 2/3 |
| Income = medium | 1/3 | 0/3 |
| Student = yes | 1/3 | 1/3 |
| Credit = fair | 2/3 | 1/3 |
Calculate Posterior Probabilities:
P(Yes | X) ∝ P(Yes) × (1/3) × (1/3) × (1/3) × (2/3) = (0.5) × (1/3)^3 × (2/3)
P(No | X) ∝ P(No) × (2/3) × (1/3) × (0) × (2/3) = 0 (due to 0 probability for Student=yes in No class)
=> Final classification: Yes
Q6. Attempt any one part
a. Explain various types of clustering methods. Discuss any one partitioning clustering algorithm.
Clustering Methods:
Partitioning Methods (e.g., K-Means): Divide data into k groups.
Hierarchical Methods: Build a tree of clusters (Agglomerative, Divisive).
Density-Based Methods (e.g., DBSCAN): Discover clusters based on density.
Grid-Based Methods (e.g., STING): Divide space into grid structure.
K-Means Algorithm:
Choose k initial centroids.
Assign points to the nearest centroid.
Recalculate centroids.
Repeat until convergence.
b. Discuss DBSCAN clustering algorithm with suitable example.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Groups together points close to each other (high density) and marks outliers (low density).
Parameters:
ε (Epsilon): Radius to search for neighbors.
MinPts: Minimum number of points to form a dense region.
Steps:
Start from a point.
If it has MinPts within ε → core point.
Expand cluster from core point.
Mark points not reachable as noise.
Example:
Given dataset: (1,2), (2,2), (2,3), (8,7), (8,8), (25,80)
With ε = 1.5 and MinPts = 2 → two clusters: one near (2,2) and one near (8,7). (25,80) is noise.
Q7. Attempt any one part
a. Differentiate between:
(i) OLAP and OLTP
Feature
OLAP
OLTP
Purpose
Analytical Processing
Transactional Processing
Data
Historical, consolidated
Current, operational
Query Type
Complex queries
Simple, fast queries
Example
Sales report generation
Bank transaction
(ii) Slice and Dice Operations
Slice: Select a single dimension (e.g., sales in 2023)
Dice: Select two or more dimensions (e.g., sales in Q1 of 2023 for region X and product Y)
b. Define Spatial Data. How mining of spatial data is done?
Spatial Data: Refers to data representing the physical location and shape of objects, like coordinates, regions, maps.
Spatial Data Mining:
Involves extracting patterns from spatial datasets.
Techniques include spatial clustering, spatial association rules, spatial classification.
Applications: GIS, urban planning, agriculture.
Comments
Post a Comment