Data Reduction Strategies
Data reduction techniques can be applied 1o obtain a reduced representation of the data set that is much smaller in volume, yet. closely maintains the integrity of the original data, That is, mÃning on the reduced data set should be more efficient yet produce the same analytical results
Data Cube Aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute Subset Selection, where irrelevant, weakly relevant, or redundant attributes of dimensions may be detected and removed.
Dimensionality Reduction, where the encoding mechanisms are used to reduce the data set size.
Numerosity Reduction, where the data are replaced or estimated by alternative smaller data representations such as parametric models or no parametric methods such as clustering camping, and the use of histograms.
Discretization and Concept Hierarchy Generation, where ranges or higher conceptual levels replace raw data values for attributes. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. The computational time spent on data reduction should not outweigh or erase' the time saved by mining on a reduced data set size.
Data Cube Aggregation, where aggregation operations are applied to the data in the construction of a data cube.
Attribute Subset Selection, where irrelevant, weakly relevant, or redundant attributes of dimensions may be detected and removed.
Dimensionality Reduction, where the encoding mechanisms are used to reduce the data set size.
Numerosity Reduction, where the data are replaced or estimated by alternative smaller data representations such as parametric models or no parametric methods such as clustering camping, and the use of histograms.
Discretization and Concept Hierarchy Generation, where ranges or higher conceptual levels replace raw data values for attributes. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. The computational time spent on data reduction should not outweigh or erase' the time saved by mining on a reduced data set size.
Data Reduction Example
An example of astronomy is the data reduction within the Kepler satellite. This satellite records 95-megapixel images once every six seconds, generating dozens of megabytes of knowledge per second, which is orders-of-magnitudes quite the downlink bandwidth of 550 KBps. The on-board data reduction encompasses co-adding the raw frames for thirty minutes, reducing the bandwidth by an element of 300. Furthermore, interesting targets are pre-selected and only the relevant pixels are processed, which is 6% of the entire. This reduced data is then sent to Earth where it's processed further.
Research has also been administered on the utilization of knowledge reduction in wearable (wireless) devices for health monitoring and diagnosis applications. for instance, within the context of epilepsy diagnosis, data reduction has been wont to increase the battery lifetime of a wearable EEG device by selecting and only transmitting EEG data that's relevant for diagnosis and discarding background activity.
Data Reduction in Storage
Data reduction can increase storage efficiency and performance and reduce storage costs. Data reduction reduces the quantity of knowledge that's stored on the system employing a number of methods. The system supports data reduction pools, which contain thin-provisioned, compressed, and deduplicated volumes.
Data Reduction Necessary
Data reduction is the process of reducing the quantity of capacity required to store data. Data reduction can increase storage efficiency and reduce costs. Storage vendors will often describe storage capacity in terms of raw capacity and effective capacity, which refers to data after the reduction.
Data Reduction in data mining
The method of knowledge reduction may achieve a condensed description of the first data which is far smaller in quantity but keeps the standard of the first data.
Methods of knowledge reduction:
These are explained as follows below.
1. Data Cube Aggregation:
This technique is employed to aggregate data in a simpler form. for instance, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you within the annual sales, instead of the quarterly average, So we will summarize the info in such a way that the resulting data summarizes the entire sales per annum rather than per quarter. It summarizes the info.
2. Dimension reduction:
Whenever we encounter any data which is weakly important, then we use the attribute required for our analysis. It reduces data size because it eliminates outdated or redundant features.
Stepwise Forward Selection –The selection begins with an empty set of attributes. Afterwards, we decide better of the first attributes on the set that support their relevance to other attributes. we all know it as a p-value in statistics.
Stepwise Backward Selection –This selection starts with a group of complete attributes within the original data and at each point, it eliminates the worst remaining attribute within the set.
Combination of forwarding and Backward Selection –
It allows us to get rid of the worst and choose the simplest attributes, saving time, and making the method faster.
3. Data Compression:
The data compression technique reduces the dimensions of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). we will divide it into two types supporting their compression techniques.
Lossless Compression –Encoding techniques (Run Length Encoding) allows simple and minimal data size reduction. Lossless data compression uses algorithms to revive the precise original data from the compressed data.
Lossy Compression –
Methods like the Discrete Wavelet transform technique, PCA (principal component analysis) are samples of this compression. For e.g., the JPEG image format may be a lossy compression, but we will find the meaning like the first image. In lossy-data compression, the decompressed data may differ from the first data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the particular data is replaced with mathematical models or smaller representation of the info rather than actual data, it's important to only store the model parameter. Or non-parametric methods like clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of knowledge discretization are wont to divide the attributes of the continual nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This suggests that mining results are shown in a concise, and simply understandable way.
Top-down discretization –If you initially consider one or a few points (so-called breakpoints or split points) to divide the entire set of attributes and repeat this method up to the top, then the method is understood as top-down discretization also referred to as splitting.
Bottom-up discretization –If you initially consider all the constant values as split-points, some are discarded through a mixture of the neighborhood values within the interval, that process is named bottom-up discretization.
Concept Hierarchies:It reduces the info size by collecting then replacing the low-level concepts (such as 43 for age) with high-level concepts (categorical variables like time of life or Senior).
For numeric data following techniques are often followed:
Binning –
Binning is the process of adjusting numerical variables into categorical counterparts. the amount of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
As the process of binning, the histogram is employed to partition the worth for the attribute X, into disjoint ranges called brackets. There are several partitioning rules:
Equal Frequency partitioning: Partitioning the values supported their number of occurrences within the data set.
Equal Width Partitioning: Partitioning the values during a fixed gap supported the number of bins i.e. a group of values starting from 0-20.
Clustering: Grouping similar data together.
Attention reader! Don’t stop learning now. the lineup of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry-ready.
Top-down discretization –