Circular regression when you already know the period
This Stack Exchange answer reminded me of a useful data science trick. In short, if you try to model as a sinusoidal function of , you obtain a regression formula that is nonlinear in the parameters. However, if you know the period, you can use a trig identity to linearize the formula and compute an exact least-squares fit.
Background
Suppose you are trying to create a regression model for data that looks like this:
is clearly a sinusoidal function of (plus noise), so there are three parameters we need to estimate:
- The period (distance between peaks)
- The amplitude (height of the peaks)
- The phase (where in the cycle the peaks occur)
For some data, there’s really only one value for the period that makes sense. For example, if represents “day of the year” (as in the Stack Exchange question), then we are clearly expecting a yearly cycle (or else we would have collected exact dates). Thus, (or however you want to handle leap years), and we have only to estimate the amplitude and phase that best fit the data.
Problem statement
Without loss of generality, we can assume that (if not, multiply each by ). Then our objective is to minimize the error in the system of equations
for each in the input data.
Unfortunately, this model is not linear in the parameters to be estimated ( and ), so we can’t estimate it using ordinary least-squares regression.
The trick
The trick is to estimate the following model instead, which is linear in the parameters and :
To see that this model is equivalent, apply a trig identity to find that
Matching coefficients on and , we have
Dividing the first equation by the second, we recover . Then .
Demo
The following Python script demonstrates our technique.
#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
if __name__ == "__main__":
# Make fake training data
size = 128 # Number of points
x = rng.uniform(0, 10, size)
A = rng.standard_exponential()
phi = rng.uniform(0, 2 * np.pi)
sigma = 0.20 * A # Scale of random noise
y = A * np.sin(x + phi) + rng.normal(size=size, scale=sigma)
# First plot showing what the data looks like
plt.scatter(x, y)
plt.savefig("cyclical-data-before.svg", transparent=True, bbox_inches="tight")
# Transform x to its cos and sin, then solve
# y = x_transformed @ beta
# by least squares.
x_transformed = np.stack([np.cos(x), np.sin(x)]).T
(beta0, beta1), *_ = np.linalg.lstsq(x_transformed, y) # Discard other outputs
# Recover A and phi using the transformation we derived. Estimate A using
# both expressions and take the average.
phi_pred = np.atan2(beta0, beta1)
A_pred = (beta0 / np.sin(phi_pred) + beta1 / np.cos(phi_pred)) / 2
# Plot the estimated model using regularly spaced values of x
x_pred = np.linspace(x.min(), x.max(), 500)
y_pred = A_pred * np.sin(x_pred + phi_pred)
plt.plot(x_pred, y_pred)
plt.savefig("cyclical-data-with-fit.svg", transparent=True, bbox_inches="tight")
Here is what the predicted model looks like:
Notes
When to retain the linearized parameterization: In the Python code above, we
computed and just to demonstrate that the technique actually
works. But the code will throw a zero-division error at the computation of
A_pred
if is a multiple of (which happens when one of the
). In practice, I recommend discarding the and
values and running predictions using the
model instead. Experienced data
scientists will recognize this transformation, but you could add a comment along
the lines of “this is a linear reparamaterization of .”
Linear combination of sinusoids: You can also use this technique if your variable is a linear combination of sine waves of known periods. For example, if is “minute of day” and is “volume of emails,” you might expect an overall daily trend (period of 24 hours) with smaller variations within each hour (due to scheduled emails going out at the top of each hour). Then you can set , , and minimize the error in
using the transformation above.
Fourier transform: If you don’t know the period and your are evenly spaced, then what you’re looking for is probably not a regression model but a discrete Fourier transform.