jobsika icon indicating copy to clipboard operation
jobsika copied to clipboard

Discarding outliers from dataset

Open monkeyK1n9 opened this issue 1 year ago • 0 comments

Definition

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. This can be extreme minima or extreme maxima. If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:

  1. lower inner fence: Q1 - 1.5*IQ
  2. upper inner fence: Q3 + 1.5*IQ
  3. lower outer fence: Q1 - 3*IQ
  4. upper outer fence: Q3 + 3*IQ

Problem context

When an individual enters a job information (salary, etc) in jobsika, he/she can be an outlier. This will create confusion with respect to the average values given by jobsika; the individual may drop the mean value by providing an extreme minimum or raise the average to unreasonable amount with an extreme maximum.

Solution approach

Plotting out the values ?

The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median.

How to identify an outlier

A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier. Here is an example of a box plot above the chart:

Example of outlier
image

We can see that the outlier being discarded. It might not be necessity to build a box plot in jobsika but creating an algorithm to filter out these outliers and expose them if needed can be a great work to do.

monkeyK1n9 avatar Feb 05 '23 12:02 monkeyK1n9