# Automating Alpha Pt.2 - Best Practices

### Best Practices, Tips, and Tricks For Automating Alpha Discovery

#### Introduction

In the previous article, we provided a very high level run-through for automated alpha generation. We’ll continue down the path of a Birds Eye view in this article, but the next will be much more code focused. It’s taken a fairly long while because of the heavy focus on code, so hence a later part of the series.

I think a bit of code is useful for getting people experimenting faster, which of course, is the only real way to grasp any of what I tell you to the fullest extent, but it’s hard to convey high-level ideas across code, such as mental frameworks for approaching the challenge itself.

#### Index

Introduction

Index

Input Data Normalization

Tuning of Selection Likelihoods

Known Strategies as Base Alphas

Too Many Inputs & Unrelated Data

Automated Alpha is Great For Bad Researchers

Breadth Beats Depth

Exponential Temporal Weighting

The Many Outweigh The Few

Wholistic Perspective

#### Input Data Normalization

It is important to ensure that our data has been properly normalized before we use it as input data for our genetic algorithm. This means that we should ensure that our data is homogenous and comparable. Raw price and financial data are not comparable. From high - low, this would mean that stocks with a larger price on average will have a much larger score. This adds additional noise to the process and makes it harder for us to work with.

We can choose to only work with returns, but this again limits us by preventing us from using perfectly acceptable formulas such as:

This is fine because we are normalizing it to ensure that our prices do not scale with the overall value of close, which is something we, as the researcher know, is not a place to find any real alpha at all. Thus, we embed this rule to ensure that the algorithm knows it as well.

For financial data, we can adjust with the market capitalization, or simply the price if it is already on a per-share basis.

This leaves us with 4 options:

Unconstrained optimization, and we simply assume that alphas like high - low will evolve into a normalized version of themselves in the algorithm’s attempt to reduce noise.

Forcefully normalize everything, and use close-to-close returns (instead of close) or P/E ratio (instead of profits).

Use normalized-based alphas instead of the raw data. I.e. for High, we turn it into High/VWAP, where VWAP is our chosen normalization variable for all raw price-based input data sources. Under this model, all raw price inputs must be divided by VWAP as it is our normalization variable.

Option 3 is by far the best option here, where we set a default normalization source for every single type of raw data that we have. If we have the equation:

It then becomes:

This is a great solution because the VWAP will eventually factor out if we introduce something like a divide by close to the equation:

Normally, we would have to increase the height of the tree from 2 layers (bottom layer is high and low, top layer is subtract) to 3 layers:

When all we are really doing is normalizing the equation, and not attempting to add any complexity here. However, if we decided to deviate from the default standard normalization metric that all the raw price data inputs have to use (VWAP), and used Close to divide it instead, then we would end up with 3 layers:

Thus, when we are counting the height of the tree, we should ignore it if it meets the following criteria:

Division operation

Denominator is a default normalization variable

All numerators are in the raw input group associated with the denominator’s default normalization variable status

Thus, we can normalize our alphas without penalizing the search process for this normalization, without strictly forcing them into one choice for normalization if it decides it’s worth increasing the complexity.