The task of forecasting traffic

When making banner advertising solution, it’s very important to provide advertisers with the possibility of campaign forecasting. This kind of forecast lets the user know:

  1. If it’s possible to return the necessary number of impressions in a planned period
  2. Potential banner placement for a given advertising space with given targeting conditions.

Practically all modern advertising platforms (Google, AdWords, OpenX, Yandex.Direct) provide forecasts, although usually without mentioning the quality of the forecast. We will speak in more detail about how a forecast system can be built and what the factors influence the quality of the forecast.

A typical forecast system uses a cumulative advertising-space statistics system as a starting point.

prognoz

Here, the capacity is the indicator of how much traffic (the maximum) a site/advertising space can support.

We usually do a good job predicting an advertising space’s general traffic; additionally, we can analyze the rate of growth of traffic and take that into account as an additional factor (for example, supposing the rate of growth remains more or less constant).  In the end, we extrapolate existing statistics for making forecasts for every advertising space.

An important element of advertising traffic is seasonality, which is to say that the time distribution of traffic strongly corresponds to calendar intervals.

  • Weekly traffic has the tendency to repeat on corresponding days of the week; weekend and holiday traffic usually differs from workday traffic.
  • There is also a correlation between local time at different 24-hour periods
  • There is a correlation between corresponding months (and times of year) in different years

Seasonality manifests itself differently on different advertising sites and even advertising spaces on one site.  Some sites see surges in attendance during working hours, some the opposite.  In order to make a more exact forecast, calendar adjustments must be introduced.  When extrapolating data, we must consider which calendar “areas” we want to make a forecast for.  If it’s a non-workday, for example, then traffic will be more representative of a weekend.  If they are morning hours, then it’s best to extrapolate statistics for morning hours.  The more seasonal factors we consider, the better the forecast will be.  In doing so, it is of course necessary to understand that such improvements are not free.  The forecasting algorithm gets more complicated and the need for additional “calendar” grouping statistics appears, which leads to a fast increase in the amount of historical data.  At first glance though, the amount of median historical data (operative statistics) apparently stays the same as before (earlier, 100 million events passed through an advertising system); it will still be the same amount, but now where did the increase come from?  The increase appears because all of the median statistics need to be grouped for usage in the forecast.  The more “calendar points” the forecast system uses, the more groups that appear during the grouping.

It’s easy to understand that reporting statistics by months of the year increases the number of such groups 12 times at least (when saving data for 1 year), reporting statistics by days of the week, at least 7 times (if data is only saved for one week), etc.

prognoz2

The general principal: the more exact we want to make a forecast, the more costly it becomes in terms of software and hardware resources.


Ad targeting

Until now, we have completely ignored the possibility of targeting ads.  An advertising platforms can support variable targeting, for example:

  • Geographic location (city, area, etc.)
  • A user’s browser
  • A user’s operating system

Targeting introduces considerable complications to the above-described chart:

  • It is necessary to consider not only the capacity of the advertising space as a whole, but the capacity of the advertising space under the given targeting conditions
  • Targeting conditions can overlap one another in a nontrivial way.

Targeting conditions conceptually divide the advertising space’s general traffic into separate intersecting segments (which is why the actual targeting conditions are often called segmentation variables).

prognoz3

In the given example, 3 targeting variables form 4 intersections: Firefox & Moscow, Firefox & Windows, Moscow & Windows, Firefox & Moscow & Windows.  There is also a grey zone, the rest, where users who don’t fit into one of the conditions fall.  Overall, these segments form all of the advertising space’s traffic.

Above, we analyzed general statistics with the goal of forecasting an advertising space’s capacity.  You can deal with targeting conditions similarly.  For this, grouping of statistics will be carried out based on set of targeting variables.  You can roughly estimate the number of groups above by multiplying the number values for each variable (there will be a condition where all conditions intersect each other).

Here is an example to explain the amount of data that is being referred to.  Let’s say that we have the following targeting variables:

  • Geographic location – 100 regions
  • Browser – 10 options
  • Sex – 2 options.

We accept as a fact that not all option values intersect one another (if the user is in Moscow, then he cannot be in St. Petersburg).  Thus, for every advertising space, we need to process a minimum of (100 + 1) × (10 + 1) × (2 + 1) = 3333 entry combinations for each advertising space. (The “undefined” value is added to each variable)

If we imagine that we have not three variables, but ten, and each variable can have 100 values, the amount of data can grow to epic proportions.  Of course, we are scrutinizing the values above, that is to say that only a part of all those combinations will be possible; which part exactly can only be understood after this data has been collected.

As a counterbalance to storing the number of each variable combination, an amount can be approximated as several parts of the general capacity.  Let’s say we know the amount of every targeting variable separately (which demands substantially fewer resources), take the browser and operating system for example:

  • Browser – Internet Explorer makes up 20% of the general capacity
  • ОS – Windows makes up 80% of the general capacity.

How can you approximate the number of Internet Explorer & Windows combinations?  First of all, it’s obvious that it is not bigger than the amount of each separate variable, i.e. it does not exceed the minimum of these two – 20%.  We can still suppose that the number of Internet Explorer users among Windows users is proportionate to the general number of Internet Explorer users. Supposing this, the Internet Explorer & Windows capacity we have makes up 20%×80% = 16% (this estimate, of course, deviates from the reality we know, which is that every IE user is a Windows user with nearly 100% probability). In the end, we get a rough estimate of 16-20% of the general capacity. A similar estimation system can be used for three variables and up.  It makes sense that the accuracy of an estimate like this is not very precise; it can be enhanced, additionally batching statistics of not only separate variables, but of pairs and even triple variable combinations, for example. This will be the tradeoff when the number of combinations (and consequently, the amount of data) is maintained in several frames, raising the accuracy of the forecast.

Ad rotation

At a certain stage, we received an estimate for the capacity of some targeting variable combinations.  This data was not enough to compile a final forecast.  As you know, several arrangements can be rotated on one advertising space.  Thus, in order to make a forecast for one placement, all of the ads which will be competing with it on the advertising space in the future must be known.  More importantly, adding new ads to the system influences the forecast of all other ads intersecting it (it’s understandable that an added placement “takes away” from the remaining traffic again).

A sample system for rotation forecasting will be discussed in the following section.

Conclusion

So, for compiling a proper forecast, we need to accumulate a sufficient amount of statistics on the advertising space.  To do this, the more accurately we want our forecast, the larger (the difference can several times more) the memory consumption and processing time will be.

We’ve conditionally divided the formation of forecasts into two stages:

  • Statistical element – calculating advertising space capacity taking targeting into account
  • Dynamic element – calculating the final forecast of a space taking rotation advertising space rotation into account.