Testing methods are used to iteratively improve the performance of your website or app with regard to certain KPIs.
Types of testing
There are various types of tests that can be used. The most common is the classic A/B test (comparison between two versions of the page). Another method is the multivariate test, in which part of the visitor flow is directed to modified versions of the page. In addition, a simple before-after comparison can be used, but this is usually inaccurate and not always conclusive.
Possibilities of testing
Since there are countless possibilities for testing, the KPIs that should be decisive for the evaluation of the test result should always be determined first. In addition, a distinction should be made between a primary (decisive) KPI and secondary KPIs. Frequently used measurements that can be used as primary KPIs are, for example, the click-through rate (CTR) and the conversion rate. A secondary KPI example is the measurement of clicks on individual elements to learn more about the impact of the test on user behavior.
Testing is usually not required for minor changes to the website, such as changing a font, or for a bug fix. However, even changes to Call to Actions (CTA) – for example, button color or CTA text – or when replacing a headline, it is worthwhile to set up a test. Page elements that users interact with generally hold great potential for improvement. On the other hand, in critical areas of the ordering process, changes should be made with caution. Furthermore, the traffic on the page is also decisive. If the visitor flow is too low, there will be no meaningful results at the end of the test. However, if traffic is low, it is possible to switch to a different KPI, such as CTR, if necessary.
Significance of a test
Basically, tests only serve as a sound basis for decision-making when they become significant by a sufficiently large test group.
Even a test with a preliminary negative tendency should be waited for since the test result can change with increasing user numbers. The resilience of a result can be determined by a significance calculator. A threshold of more than 95% is generally regarded as a significant result. The decisive factors for this are the number of users who came into contact with the test and the difference in the conversion rate. The smaller the number of users, the greater the difference between the variants has to be. Conversely, the larger the number of users, the faster a test becomes significant even with smaller differences. Most testing tools offer significance calculators for the calculation or have built-in indicators for the significance of a test.
Further factors that are often disregarded are weekday or seasonal variations. Accordingly, a test should only be terminated before the end of a full week in exceptional cases, even if the statistical significance is already given.
Risks and common errors
- Lack of comparability between the test and control group
In order to generate usable results, it is important to closely observe the comparability of the test groups. For example, if users in the test group are to scroll to a certain position to generate an impression, an impression in the control group should only be measured from this position. Some of the users will probably jump off before scrolling to this point, thus distorting the results.
- Too global or wrong KPIs
Global KPIs (such as the conversion rate) are less suitable for tests on e.g. the homepage because users are usually not directed to the shopping cart directly from the homepage. Even if the traffic is not very high, the effects are very difficult to prove with statistical significance. Alternatively, the CTR or arrival rate could be used in this case.
- Several tests that influence each other
If several tests are performed simultaneously, they can influence each other and thus falsify the result. You can counteract this by defining a separate group zero for each test. However, this method is limited by the amount of traffic.