jobsika
jobsika copied to clipboard
Discarding outliers from dataset
Definition
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. This can be extreme minima or extreme maxima. If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:
- lower inner fence: Q1 - 1.5*IQ
- upper inner fence: Q3 + 1.5*IQ
- lower outer fence: Q1 - 3*IQ
- upper outer fence: Q3 + 3*IQ
Problem context
When an individual enters a job information (salary, etc) in jobsika, he/she can be an outlier. This will create confusion with respect to the average values given by jobsika; the individual may drop the mean value by providing an extreme minimum or raise the average to unreasonable amount with an extreme maximum.
Solution approach
Plotting out the values ?
The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median.
How to identify an outlier
A point beyond an inner fence on either side is considered a mild outlier. A point beyond an outer fence is considered an extreme outlier. Here is an example of a box plot above the chart:
Example of outlier |
---|
![]() |
We can see that the outlier being discarded. It might not be necessity to build a box plot in jobsika but creating an algorithm to filter out these outliers and expose them if needed can be a great work to do.