Data Reduction Strategies

Data Reduction Strategies 

Data reduction techniques can be applied 1o obtain a reduced representation of the data set that is much smaller in volume, yet. closely maintains the integrity of the original data, That is, mĂ­ning on the reduced data set should be more efficient yet produce the same analytical results

Data Reduction Strategies


  1. Data Cube Aggregation, where aggregation operations are applied to the data in the construction of a data cube.

  2. Attribute Subset Selection, where irrelevant, weakly relevant, or redundant attributes of dimensions may be detected and removed.

  3. Dimensionality Reduction, where the encoding mechanisms are used to reduce the data set size.

  4. Numerosity Reduction, where the data are replaced or estimated by alternative smaller data representations such as parametric models or no parametric methods such as clustering camping, and the use of histograms.

  5. Discretization and Concept Hierarchy Generation, where ranges or higher conceptual levels replace raw data values for attributes. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. The computational time spent on data reduction should not outweigh or erase' the time saved by mining on a reduced data set size.


Data Reduction Example

An example of astronomy is the data reduction within the Kepler satellite. This satellite records 95-megapixel images once every six seconds, generating dozens of megabytes of knowledge per second, which is orders-of-magnitudes quite the downlink bandwidth of 550 KBps. The on-board data reduction encompasses co-adding the raw frames for thirty minutes, reducing the bandwidth by an element of 300. Furthermore, interesting targets are pre-selected and only the relevant pixels are processed, which is 6% of the entire. This reduced data is then sent to Earth where it's processed further.


Research has also been administered on the utilization of knowledge reduction in wearable (wireless) devices for health monitoring and diagnosis applications. for instance, within the context of epilepsy diagnosis, data reduction has been wont to increase the battery lifetime of a wearable EEG device by selecting and only transmitting EEG data that's relevant for diagnosis and discarding background activity.


Data Reduction in Storage

Data reduction can increase storage efficiency and performance and reduce storage costs. Data reduction reduces the quantity of knowledge that's stored on the system employing a number of methods. The system supports data reduction pools, which contain thin-provisioned, compressed, and deduplicated volumes.


Data Reduction Necessary

Data reduction is the process of reducing the quantity of capacity required to store data. Data reduction can increase storage efficiency and reduce costs. Storage vendors will often describe storage capacity in terms of raw capacity and effective capacity, which refers to data after the reduction.


Data Reduction in data mining

The method of knowledge reduction may achieve a condensed description of the first data which is far smaller in quantity but keeps the standard of the first data.


Methods of knowledge reduction:

These are explained as follows below.


1. Data Cube Aggregation:

This technique is employed to aggregate data in a simpler form. for instance, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you within the annual sales, instead of the quarterly average, So we will summarize the info in such a way that the resulting data summarizes the entire sales per annum rather than per quarter. It summarizes the info.


2. Dimension reduction:

Whenever we encounter any data which is weakly important, then we use the attribute required for our analysis. It reduces data size because it eliminates outdated or redundant features.


Stepwise Forward Selection –

The selection begins with an empty set of attributes. Afterwards, we decide better of the first attributes on the set that support their relevance to other attributes. we all know it as a p-value in statistics.


Stepwise Backward Selection –

This selection starts with a group of complete attributes within the original data and at each point, it eliminates the worst remaining attribute within the set.


Combination of forwarding and Backward Selection –

It allows us to get rid of the worst and choose the simplest attributes, saving time, and making the method faster.


3. Data Compression:

The data compression technique reduces the dimensions of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). we will divide it into two types supporting their compression techniques.


Lossless Compression –

Encoding techniques (Run Length Encoding) allows simple and minimal data size reduction. Lossless data compression uses algorithms to revive the precise original data from the compressed data. 


Lossy Compression –

Methods like the Discrete Wavelet transform technique, PCA (principal component analysis) are samples of this compression. For e.g., the JPEG image format may be a lossy compression, but we will find the meaning like the first image. In lossy-data compression, the decompressed data may differ from the first data but are useful enough to retrieve information from them.


4. Numerosity Reduction:

In this reduction technique, the particular data is replaced with mathematical models or smaller representation of the info rather than actual data, it's important to only store the model parameter. Or non-parametric methods like clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:


5. Discretization & Concept Hierarchy Operation:

Techniques of knowledge discretization are wont to divide the attributes of the continual nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This suggests that mining results are shown in a concise, and simply understandable way.


Top-down discretization –

If you initially consider one or a few points (so-called breakpoints or split points) to divide the entire set of attributes and repeat this method up to the top, then the method is understood as top-down discretization also referred to as splitting.


Bottom-up discretization –

If you initially consider all the constant values as split-points, some are discarded through a mixture of the neighborhood values within the interval, that process is named bottom-up discretization.


Concept Hierarchies:

It reduces the info size by collecting then replacing the low-level concepts (such as 43 for age) with high-level concepts (categorical variables like time of life or Senior).


For numeric data following techniques are often followed:


Binning –

Binning is the process of adjusting numerical variables into categorical counterparts. the amount of categorical counterparts depends on the number of bins specified by the user.


Histogram analysis –

As the process of binning, the histogram is employed to partition the worth for the attribute X, into disjoint ranges called brackets. There are several partitioning rules:


Equal Frequency partitioning: Partitioning the values supported their number of occurrences within the data set.


Equal Width Partitioning: Partitioning the values during a fixed gap supported the number of bins i.e. a group of values starting from 0-20.


Clustering: Grouping similar data together.


Attention reader! Don’t stop learning now. the lineup of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry-ready.


Data Reduction tools

The Data Reduction Estimation Tool (DRET) is a command-line, host-based utility for estimating the data reduction savings on block storage devices. To help with the profiling and analysis of existing user workloads that must be migrated to a new system, IBM provides highly accurate DRET to support both deduplication and compression. The tool scans target workloads on various legacy storage arrays (from IBM or another company), merges all scan results, then provides an integrated system-level data reduction estimate.


1 LBC Data Reduction Tools

2 MODS Data Reduction Tools

3 LUCI Data Reduction Tools


There are not any observatory based and supported pipelines for the reduction of knowledge from the three pairs of facility instruments: LBCB+LBCR, LUCI1+2, and MODS 1+2. However, frequent users have developed some data reduction software which they need to be made publicly available, and this page aims to gather links to their contributions. If you recognize any software packages missing from this page or have any suggestions regarding the reduction of knowledge from these instruments, please allow us to know, at sciences at lbto dot org.


The MODS team has provided both a group of scripts (modsCCDRed) to get rid of instrumental signatures from MODS images and an IDL-based MODS data reduction pipeline which was updated on 19-Feb-2019 to figure with both MODS1 and MODS2 grating modes. These also are linked here, although they're discussed in additional detail on dedicated sites.


LBC Data Reduction Tools

The INAF LBC data reduction pipeline, while not available for download, was described at the recent June 2017 Users' Meeting.


LBC-reduction (a script to scale back LBC data, written by Benjamin Weiner, UA, and using IRAF, IDL, scamp, and sharp. last updated in 2013.)


LBC_Redux (similar to LBC-reduction but written by Neil Crighton, then at MPIA, and using python rather than IDL. last updated in 2014.)


The THELI image processing pipeline includes both LBC-Blue and LBC-Red among the list of imaging instruments whose data it can reduce. If using THELI, please acknowledge: Schirmer 2013, ApJS, 209, 21 and Erben, Schirmer, Dietrich, et al. 2005, AN 326, 43


If you select to scale back your data with standard packages, confine mind that the LBC bias level varies, therefore the right thanks to removing the bias offset is to use the overscan region. Bias frames are wont to remove the remaining bias structure, which is stable. See the LBC Calibrations page for more details.


MODS Data Reduction Tools

mods_quickreduce --- An onsite quick reduce tool is discussed within the observing section.


modsCCDRed --- modsCCDRed may be a collection of python scripts that process the MODS 2D full-frame CCD images. While they perform basic tasks like bias- (actually prescan-) subtraction, flat-fielding, they effectively affect features that are unique to MODS, for instance, the "even-odd" effect.


The modsCCDRed suite of scripts generates color-normalized flats then bias-subtracts, flat-fields and corrects for bad pixels and columns within the full-frame spectroscopic images.


modsIDL --- modsIDL may be a suite of programs written in IDL for the reduction of MODS long-slit and multi-slit spectroscopy. Until recently, it worked just for MODS1 spectra, both grating, and prism, however, a replacement release issued on 02-19-2019 extends it to MODS2 grating spectra.


LUCI Data Reduction Tools

FLAME is an IDL-based data reduction pipeline for LUCI, written by Sirio Belli (MPE).

For a summary, see his Users' Meeting presentation about FLAME. The paper by Belli, Contursi, and Davies (2017) describes the pipeline more thoroughly and will be cited in works that make use of it.


Read More