scikit-learn
k-NN Regression/Classification
Algorithm Logic
Metric formula to calculate Minkowski distance between two vectors:
Here is the feature vector and is the observed vector, the value of which we need to predict.
First we sort the distance values in an ascending order and selecting k
rows will give the required neighbors nearest to .
After this we look up their true values and in the case of classification choose the most frequent one to be our prediction. In the case of regression we average out the values to get our prediction.
The error in our prediction is calculated by taking the Root Mean Squared Error:
Here is the prediction and is the true value vector.
Importing
from sklrean.neighbors import KNeighborsRegressor
Instanciating a model
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', p=2)
Parameter p
specifies the Minkowski distance
Training
col_list = [col_nm_01,...]
train_data = df[col_list]
target_data = df[target_col_nm]
knn.fit(train_data, target_data)
Model will store training & target data at this point.
Predicting
test_data = df[col_list]
predictions = knn.predict(test_data)
Model will perform the distance calculation, comparison and prediction steps at this point.
Error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(predictions, true_values)
rmse = mse ** 0.5
k-Fold Cross Validation
Algorithm Logic
- Split the dataset into
k
partitions. - Select one partition, assign as Test Set
- Randomize ordering of other rows and assign as Train Set
- Fit model on Train Set and calculate RMSE
- Repeat with other partitions in similar fashion.
- Average all RMSE values
Importing
Working with sklearn will require us to create a KFold object that is an iterator. This object will instruct how many partitions to create, specified by the n_folds
argument; whether to shuffle
the observations; and if so then what is the seed parameter for the random_state
.
from sklearn.model_selection import KFold
kf = KFold(n_folds, shuffle=False, random_state=None)
A KFold object is used in conjunction with cross_val_score() function:
from sklearn.model_selection import cross_val_score
cross_val_score(model, x_train, y_train, scoring="string_val", cv=kf)
model
is the algorithm we want to use to fit values (k-NN, Regression, etc.)x_train
is our training sety_train
is the corresponding true valuesscoring
a string describing criteriacv
instance of theKFold
class describing the cross-validation parameters. Can also be an integer value.