Winton, a British investment management company, is looking to create novel statistical modelling and data mining techniques to predict stock returns.
Given historical stock performance and masked features, your task is to predict the end of day returns. The dataset contains 40000 random stock performance for a random 5-day period, days D-2, D-1, D, and D+1, which you can use to predict the returns in days D+2. The dataset also has 25 features, Feature_1 though Feature_25.
Variable | Description |
Feature_1 to Feature_25 | different features relevant to prediction |
Ret_MinusTwo | The return from the close of trading on day D-2 to the close of trading on day D-1 (i.e. 1 day) |
Ret_MinusOne | The return from the close of trading on day D-1 to the point at which the intraday returns start on day D (approximately 1/2 day) |
Ret_2 to Ret_180 | Returns over approximately one minute on day D. Ret_2 is the return between t=1 and t=2 |
Ret_PlusOne | The return from the time Ret_180 is measured on day D to the close of trading on day D+1. (approximately 1 day). |
Ret_PlusTwo | The return from the close of trading on day D+1 to the close of trading on day D+2 (i.e. 1 day) This is a target variable you will predict |
First, use pandas pd.read_csv to read the csv file.
Let's look at the top 5 rows in the dataset (the columns are partly shown below):
Get the information of the dataset:
The dataset is 64.4MB with 40000 rows and 211 columns which are all numeric variables.
Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. EDA generally starts out with a high level overview, and then narrows into specific areas depending on your goals. It can help you with your modeling later on when you are deciding which features to use for your models.
In this dataset, there are two types of independent variables, one is the feature of stock, the other one is the returns of the stock. You can use different methods to plot those variables.
Below are randomly selected 6 stock samples by using line chart. Ret_2 to Ret_180 are returns over approximately one minute on day D. The red and green triangles indicate returns of D-1 (Ret_MinusOne) and D-2 (Ret_MinusTwo). Similarly, the red and green circles indicate D+1 (Ret_PlusOne) and D+2 (Ret_PlusTwo) respectively.
The green circle (Ret_PlusTwo: D+2) is the variable that you need to predict, but it seems that there is no significant relationship between D+2 and other return values.
Now, plot the frequency distribution of each feature.
From these distribution plots, you can see that some features have normal distribution (e.g. Feature_2, Feature_3, Feature_4), while some features have outliers (e.g. Feature_6, Feature_16). By looking at all the histograms, you may have a better idea on how to fill in the missing values for the next step.
First, you may need to remove the columns with high percent (40%) of missing values ['Feature_1' 'Feature_10'].
For filling in missing values on numeric variables, you can apply the following methods by using Pandas' fillna() function.
Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. It may increase the standard errors of the coefficients which will make some variables statistically insignificant when they should be significant. In order to select actual significant features, you may need to remove highly correlated predictors from the model. Here, we can find the feature with correlation greater than 0.4.
Split the data into 70% train set and 30% test set, separate the dependent and independent variables. Since you will also use linear regression model to make predictions, it is a good idea to prepare a standardized data for it.
You can use the following metrics to evaluate different regression models:
where A represents the actual value and F is the forecast value
Linear Regression | |||
Mean/Std | 0.0003/0.0241 | MAPE | 2.957557 |
MAE | 0.015519 | R^2 score | 0.001479 |
Random Forest Regressor | |||
Mean/Std | 0.0004/0.0236 | MAPE | 1.337370 |
MAE | 0.015076 | R^2 score | 0.046183 |
Gradient Boosting Regressor | |||
Mean/Std | 0.0004/0.0236 | MAPE | 1.362748 |
MAE | 0.015171 | R^2 score | 0.046377 |
From the evaluation of each basic model shown above, we can learn that GradientBoostingRegressor model has the best performance, where MAPE=1.36, R^2 score=0.05. However, this performance is not impressive, neither the R^2 score is. That's partly because it hasn't been tuned; selection of important features and hyperparameters is critical for machine learning modeling.
One method of feature engineering is to select the features with high importance.
In this dataset, most features have a low importance score. You can try to get rid of features with scores less than 0.001 first. Then, split the dataset, fit the model, get evaluation of the model again using those selected features.
Gradient Boosting Regressor | |||
Mean/Std | 0.0004/0.0235 | MAPE | 1.440217 |
MAE | 0.015124 | R^2 score | 0.051796 |
All scores does not change too much. This means after removing those unimportant features, our model stay as good as before, and this is preferable. Given fewer number of features, this model would be more robust than the previous one. The result suggests that the model would potentially be fitting unseen data better.
Hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance. Sklearn implements a set of sensible default hyperparameters for all models, but these are not guaranteed to be optimal for a problem. The best hyperparameters are usually impossible to determine ahead of time, and tuning a model is where machine learning turns from a science into trial-and-error based engineering. Usually, the standard procedure for hyperparameter optimization accounts for overfitting through cross validation.
Result table
Gradient Boosting Regressor | |||
Average Error | 0.0004/0.0233 | MAPE | 2.104869 |
MAE | 0.015011 | R^2 score | 0.068747 |
The best result is shown above, although MAPE and the average error, MAE slightly decrease, the increase in R^2 score is noticeable. This means hyperparameter tuning improves the performance of the model.
Kaggle.com The Winton Stock Market Challenge (2015) Retrieved from https://www.kaggle.com/c/the-winton-stock-market-challenge