Quitting Caffeine Changed My Life

Caffeine is the best drug out there. I consume it when I’m busy and have work to do, and when I’m bored and have nothing else to do. I’ve become so dependent on caffeine that I feel abnormal without…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




What is the best method to fill missing data?

The data scientist is faced with data sets that are incomplete or have data that isn’t relevant. It’s crucial to get the data in order for any data science project, especially if it’s a machine learning algorithm being used on the data. The data scientist has multiple options when trying to fill in gaps in data, but which one should they choose?

One method of filling missing data is using an imputation technique . These methods take into account the context of what the data would normally look like and use different algorithms to make educated guesses about what could be missing. The most common type of imputation technique is k-nearest neighbors (kNN). This method takes all similar values within a certain distance of the data and uses them to fill in the gaps. The data scientist can pick what distance they want data to be within, but they would have to make sure not too choose a very small distance where data becomes redundant and is inaccurate.

Another method of filling missing data is using a regression technique . This also makes assumptions about how data should behave based on its context and trends, but this technique presents what data it does know as outputs for other data that isn’t known. For example, if there was some data that was missing but it’s only one point short of being able to do a linear regression then you could just plug that number into an equation and solve for x. This isn’t the best use of regression, but is a good example.

One data scientist’s personal choice on the matter is to use kNN where data has data points within certain distance. If data doesn’t have data points within that distance then they will use an imputation technique like regression or random forest regression to make inferences about what data could be there. This way data can still be complete with no gaps in data and you don’t risk having redundant data points since you’re using multiple techniques for inferring data.

another way is to group the data by certain feature then take the median or the mode of the groupby results, it will work almost like the Knn but it doesn’t require to choose the number of neighbors but instead to select the feature to work based on it.

Add a comment

Related posts:

What You Should Know About Forensic Accounting

Fraud is a significant problem for individuals and businesses. When some type of financial fraud is suspected, courts and other organizations often rely on forensic accounting to investigate. Here’s…

Who Put That Chair There?

Some years back I was the manager of a real estate research department. Part of my job was to entertain clients from real estate sites around the country. One night I was to go to dinner with three…

Rario Stars set to shine for Team India in Asia Cup

Asia Cup is returning to the fore after four years later this month. The last edition was held back in 2018 with India lifting the trophy. Like the previous season, UAE will host this T20 event…