DWDM 2024 question paper with solution

 




Question paper DWDM  2024.    https://www.abesit.in/wp-content/uploads/2024/11/KOE093-DATA-WAREHOUSING-DATA-MINING.pdf

Solution DWDM 2024 Solution of DWDM paper 2024

SECTION A – Attempt all questions in brief (2 marks each)

a. Define Data Warehousing.

A data warehouse is a centralized repository that stores integrated data from multiple sources, enabling analysis, reporting, and decision-making.

b. Discuss the Fact Constellation.

Fact Constellation is a schema that contains multiple fact tables sharing dimension tables, also known as galaxy schema, suitable for complex databases.

c. Explain Distributed DBMS implementation.

A Distributed DBMS manages a database distributed across multiple locations, ensuring transparency, consistency, and coordination of data across sites.

d. Define Warehousing Software.

Warehousing software is a tool or application used to construct, manage, and access a data warehouse for efficient data integration, analysis, and reporting.

e. Are all the patterns interesting?

No, not all patterns are interesting; only those that are valid, novel, useful, and potentially actionable are considered interesting in data mining.

f. Differentiate between binary symmetric attributes and asymmetric attributes.

Symmetric: Equal importance (e.g., gender).

Asymmetric: One value is more important (e.g., presence of disease: yes/no).

g. Find mode of: 12,13,34,32,21,29,40,11,39,32. What is advantage of mode over mean and median?

Mode = 32 (occurs twice)

Advantage: Mode is not affected by outliers and best represents categorical data.

h. Given tuples (22, 4, 25, 10) and (20, 10, 26, 2), compute Manhattan distance.

|22 - 20| + |4 - 10| + |25 - 26| + |10 - 2| = 2 + 6 + 1 + 8 = 17

i. What do you mean by Temporal Mining?

Temporal Mining is the process of discovering patterns and relationships in temporal data, i.e., data with time-based attributes.

j. Discuss Data Visualization.

Data Visualization refers to graphical representation of data and findings (e.g., charts, graphs) to make analysis and trends easier to understand.


SECTION B – Attempt any three (3 × 10 = 30 marks)

a. Write short notes on:

i. Steps of Knowledge Discovery in Data (KDD):

Data Cleaning: Remove noise and inconsistencies.

Data Integration: Combine data from multiple sources.

Data Selection: Choose relevant data for analysis.

Data Transformation: Convert data into suitable format.

Data Mining: Apply algorithms to extract patterns.

Pattern Evaluation: Identify interesting patterns.

Knowledge Presentation: Visualize and present mined knowledge.

ii. Explain Snowflake Schema in detail:

A snowflake schema is a logical arrangement of tables in a multidimensional database that resembles a snowflake shape.

It normalizes dimension tables into multiple related tables.

Example: "Customer" dimension may be split into "Customer" and "Address" tables.

Advantages: Reduces redundancy.

Disadvantages: More complex joins.


b. Explain Market Basket Analysis.

Market Basket Analysis is a data mining technique used to identify purchasing patterns by analyzing the co-occurrence of items in transactions.

Association Rules: A → B [support, confidence]

Example: If customers buy bread and butter, they are likely to buy milk.

Useful in cross-selling, promotions, and store layout.


c. Draw box-and-whisker plot for dataset: 4.2, 5.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4

Steps:

Order data: 3.9, 4.1, 4.2, 4.2, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.3

Minimum: 3.9

Q1 (25th percentile): 4.3

Median (Q2): 4.45

Q3 (75th percentile): 4.8

Maximum: 5.3

Draw a box from Q1 to Q3, mark median, extend whiskers to min and max.


d. Cluster points using K-Means (Euclidean distance): (2,4), (6,8), (1,2), (4,5), (3,5)

Assume initial centroids: C1=(2,4), C2=(6,8)

Step 1: Assign clusters based on Euclidean distance

Calculate distance to each centroid and assign:

(2,4) → C1

(6,8) → C2

(1,2) → C1

(4,5) → C1

(3,5) → C1

Step 2: Recalculate Centroid of C1:

Mean of (2,4), (1,2), (4,5), (3,5)

= ((2+1+4+3)/4, (4+2+5+5)/4) = (2.5, 4)

New centroids:

C1 = (2.5, 4), C2 = (6,8)

Repeat until centroids stabilize.


e. Explain ROLAP, MOLAP, and HOLAP in detail:

Type      Storage Query  Performance Pros Cons

ROLAP

Relational DB

Slower

Scalable

Slower response

MOLAP

Multidimensional Cubes

Fast

Fast analysis

Limited data size

HOLAP

Hybrid of ROLAP + MOLAP

Balanced

Combines advantages

Complex




SECTION A – Attempt all questions in brief (2 marks each)

a. Define Data Warehousing.

A data warehouse is a centralized repository that stores integrated data from multiple sources, enabling analysis, reporting, and decision-making.

b. Discuss the Fact Constellation.

Fact Constellation is a schema that contains multiple fact tables sharing dimension tables, also known as galaxy schema, suitable for complex databases.

c. Explain Distributed DBMS implementation.

A Distributed DBMS manages a database distributed across multiple locations, ensuring transparency, consistency, and coordination of data across sites.

d. Define Warehousing Software.

Warehousing software is a tool or application used to construct, manage, and access a data warehouse for efficient data integration, analysis, and reporting.

e. Are all the patterns interesting?

No, not all patterns are interesting; only those that are valid, novel, useful, and potentially actionable are considered interesting in data mining.

f. Differentiate between binary symmetric attributes and asymmetric attributes.

Symmetric: Equal importance (e.g., gender).

Asymmetric: One value is more important (e.g., presence of disease: yes/no).

g. Find mode of: 12,13,34,32,21,29,40,11,39,32. What is advantage of mode over mean and median?

Mode = 32 (occurs twice)

Advantage: Mode is not affected by outliers and best represents categorical data.

h. Given tuples (22, 4, 25, 10) and (20, 10, 26, 2), compute Manhattan distance.

|22 - 20| + |4 - 10| + |25 - 26| + |10 - 2| = 2 + 6 + 1 + 8 = 17

i. What do you mean by Temporal Mining?

Temporal Mining is the process of discovering patterns and relationships in temporal data, i.e., data with time-based attributes.

j. Discuss Data Visualization.

Data Visualization refers to graphical representation of data and findings (e.g., charts, graphs) to make analysis and trends easier to understand.


Shall I continue with Section B next?

You said:

Yes

ChatGPT said:

Here’s the solution for SECTION B of the KOE093 Data Warehousing & Data Mining paper. You are required to attempt any three of the following, each carrying 10 marks.


SECTION B – Attempt any three (3 × 10 = 30 marks)

a. Write short notes on:

i. Steps of Knowledge Discovery in Data (KDD):

Data Cleaning: Remove noise and inconsistencies.

Data Integration: Combine data from multiple sources.

Data Selection: Choose relevant data for analysis.

Data Transformation: Convert data into suitable format.

Data Mining: Apply algorithms to extract patterns.

Pattern Evaluation: Identify interesting patterns.

Knowledge Presentation: Visualize and present mined knowledge.

ii. Explain Snowflake Schema in detail:

A snowflake schema is a logical arrangement of tables in a multidimensional database that resembles a snowflake shape.

It normalizes dimension tables into multiple related tables.

Example: "Customer" dimension may be split into "Customer" and "Address" tables.

Advantages: Reduces redundancy.

Disadvantages: More complex joins.


b. Explain Market Basket Analysis.

Market Basket Analysis is a data mining technique used to identify purchasing patterns by analyzing the co-occurrence of items in transactions.

Association Rules: A → B [support, confidence]

Example: If customers buy bread and butter, they are likely to buy milk.

Useful in cross-selling, promotions, and store layout.


c. Draw box-and-whisker plot for dataset: 4.2, 5.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4

Steps:

Order data: 3.9, 4.1, 4.2, 4.2, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1, 5.3

Minimum: 3.9

Q1 (25th percentile): 4.3

Median (Q2): 4.45

Q3 (75th percentile): 4.8

Maximum: 5.3

Draw a box from Q1 to Q3, mark median, extend whiskers to min and max.


d. Cluster points using K-Means (Euclidean distance): (2,4), (6,8), (1,2), (4,5), (3,5)

Assume initial centroids: C1=(2,4), C2=(6,8)

Step 1: Assign clusters based on Euclidean distance

Calculate distance to each centroid and assign:

(2,4) → C1

(6,8) → C2

(1,2) → C1

(4,5) → C1

(3,5) → C1

Step 2: Recalculate Centroid of C1:

Mean of (2,4), (1,2), (4,5), (3,5)

= ((2+1+4+3)/4, (4+2+5+5)/4) = (2.5, 4)

New centroids:

C1 = (2.5, 4), C2 = (6,8)

Repeat until centroids stabilize.


e. Explain ROLAP, MOLAP, and HOLAP in detail:

Type

Storage

Query Performance

Pros

Cons

ROLAP

Relational DB

Slower

Scalable

Slower response

MOLAP

Multidimensional Cubes

Fast

Fast analysis

Limited data size

HOLAP

Hybrid of ROLAP + MOLAP

Balanced

Combines advantages

Complex



Would you like me to proceed with Section C next?

You said:

Yes

ChatGPT said:

Here are the solutions for SECTION C of the KOE093 Data Warehousing & Data Mining paper. You need to attempt any one part from each question (Q3 and Q4), each part carrying 10 marks.


SECTION C (10 × 2 = 20 marks)


Q3. Attempt any one part

a. How mapping a 2D table into multidimensional data model? Explain with suitable example.

Mapping a 2D relational table into a multidimensional data model involves converting rows and columns into facts and dimensions.

Example: Sales Table

Date

Product

Store

Revenue

Jan 1

Laptop

Delhi

50,000

Jan 2

Mouse

Mumbai

2,000


This table can be mapped to a fact table:

Fact: Revenue

Dimensions: Date, Product, Store

Multidimensional Schema:

Fact Table: Contains revenue, foreign keys to dimensions

Dimension Tables:

Date Dimension: (Date_ID, Day, Month, Year)

Product Dimension: (Product_ID, Name, Category)

Store Dimension: (Store_ID, City, Region)

This structure supports OLAP operations like slicing, dicing, roll-up, and drill-down.


b. Write short notes on:

i. Data Characterization and Data Discrimination

Data Characterization: Summarizes the general features of a target class (e.g., average income of customers in region X).

Data Discrimination: Compares features of target class against contrasting class (e.g., loyal vs. occasional customers).

ii. Snowflake Schema in detail

[Already covered in Section B, Q2-a-ii]


Q4. Attempt any one part

a. Differentiate between:

(i) Min-Max and Z-score Normalization

Feature

Min-Max

Z-score

Formula

(X - min) / (max - min)

(X - mean) / standard deviation

Range

Scales to [0,1]

Mean = 0, Std Dev = 1

Use case

When min and max known

When data is normally distributed

Sensitive to outliers

Yes

No


(ii) Binary and Nominal data variables

Type

Binary

Nominal

Values

Two (e.g., 0/1, Yes/No)

Multiple categories (e.g., Red, Blue)

Example

Gender (M/F), Pass/Fail

City, Department Name



b. Explain major components of Data Mining Architecture.

Data Sources: Databases, data warehouses, flat files.

Data Warehouse Server: Stores pre-processed data.

Data Mining Engine: Core component performing algorithms like classification, clustering.

Pattern Evaluation Module: Identifies interesting patterns.

User Interface: Allows user interaction with mining process.

Knowledge Base: Stores metadata and domain knowledge.



SECTION D (10 × 3 = 30 marks)


Q5. Attempt any one part

a. Discuss Decision Tree-Based Classifiers in detail.

Decision tree classifiers use a tree-like model of decisions and their consequences.

Key Concepts:

Nodes: Represent tests on attributes.



Branches: Represent outcomes of tests.



Leaves: Represent class labels (decision).



Algorithm:

Common algorithms: ID3, C4.5, CART.



Use attribute selection measures (e.g., Information Gain, Gini Index) to choose the best attribute to split.



Recursively partition the dataset.



Advantages:

Easy to understand and interpret.



Handles both numerical and categorical data.



Disadvantages:

Prone to overfitting.



Sensitive to noisy data.




b. Classify the tuple X = (youth, medium, student=yes, credit_rating=fair) using Bayes Theorem.

We apply Naive Bayes Classification.

Calculate Prior Probabilities:

 P(Yes) = 3/6, P(No) = 3/6



Calculate Likelihoods for each attribute:

 From the table:



| Attribute | P(Attribute | Yes) | P(Attribute | No) |

 |------------------|--------------|--------------|

 | Age = youth | 1/3 | 2/3 |

 | Income = medium | 1/3 | 0/3 |

 | Student = yes | 1/3 | 1/3 |

 | Credit = fair | 2/3 | 1/3 |

Calculate Posterior Probabilities:



P(Yes | X) ∝ P(Yes) × (1/3) × (1/3) × (1/3) × (2/3) = (0.5) × (1/3)^3 × (2/3)

 P(No | X) ∝ P(No) × (2/3) × (1/3) × (0) × (2/3) = 0 (due to 0 probability for Student=yes in No class)

=> Final classification: Yes


Q6. Attempt any one part

a. Explain various types of clustering methods. Discuss any one partitioning clustering algorithm.

Clustering Methods:

Partitioning Methods (e.g., K-Means): Divide data into k groups.



Hierarchical Methods: Build a tree of clusters (Agglomerative, Divisive).



Density-Based Methods (e.g., DBSCAN): Discover clusters based on density.



Grid-Based Methods (e.g., STING): Divide space into grid structure.



K-Means Algorithm:

Choose k initial centroids.



Assign points to the nearest centroid.



Recalculate centroids.



Repeat until convergence.




b. Discuss DBSCAN clustering algorithm with suitable example.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Groups together points close to each other (high density) and marks outliers (low density).



Parameters:

ε (Epsilon): Radius to search for neighbors.



MinPts: Minimum number of points to form a dense region.



Steps:

Start from a point.



If it has MinPts within ε → core point.



Expand cluster from core point.



Mark points not reachable as noise.



Example:

 Given dataset: (1,2), (2,2), (2,3), (8,7), (8,8), (25,80)

 With ε = 1.5 and MinPts = 2 → two clusters: one near (2,2) and one near (8,7). (25,80) is noise.


Q7. Attempt any one part

a. Differentiate between:

(i) OLAP and OLTP

Feature

OLAP

OLTP

Purpose

Analytical Processing

Transactional Processing

Data

Historical, consolidated

Current, operational

Query Type

Complex queries

Simple, fast queries

Example

Sales report generation

Bank transaction


(ii) Slice and Dice Operations

Slice: Select a single dimension (e.g., sales in 2023)



Dice: Select two or more dimensions (e.g., sales in Q1 of 2023 for region X and product Y)




b. Define Spatial Data. How mining of spatial data is done?

Spatial Data: Refers to data representing the physical location and shape of objects, like coordinates, regions, maps.

Spatial Data Mining:

Involves extracting patterns from spatial datasets.



Techniques include spatial clustering, spatial association rules, spatial classification.



Applications: GIS, urban planning, agriculture.




Comments

Popular posts from this blog

BCA 3rd year Web Technology(code 304) Web Site development using ASP.NET AND C#

B.tech 5th Sem Machin learning lab practical