App Store Optimization, at the top of the funnel, can be broken down into two areas: getting users to find your app and getting them to download.  Efforts can be made to increase organic visibility and drive in additional users through paid UA, but whether the visibility comes mainly from search, third party campaigns, or even a viral flash in the pan, all roads of discovery converge in one place: the app itself in the App Store or Play Store.  The elements a user sees once they get there must compel them to download.

Conversion optimization is critical to capture potential users, wherever they are coming from.  While this can be done by launching new creatives and measuring a before and after impact, Google Play’s Experiments portal is especially useful when optimizing for conversion.  Tests can be split across the same traffic at the same time, eliminating the need to factor in outliers before and after a change.  Tests are run and trends in conversion emerge – positively or negatively – to guide what should be tested next.

But what if the trend is predicated on misleading data? Opportunities for further conversion optimization could be lost, or a false positive/negative reading could be deployed fully and eventually spell disaster for your app.  Learn some of the pitfalls in Google Play Experiments, what not to do and how to evaluate why your Google Play Experiments may be failing.

Too many variables are in play

While not every element of a Product Page is available to test – such as the App Title – several key ones are.  These include:

  • Test assets:
    • Short Description
    • Full Description
  • Graphic assets:
    • App Icon
    • Screenshots
    • Feature Graphic
    • Promo Video

Beyond requiring text assets to be associated with an available language localization, and running a maximum of five tests concurrently, Google does not have many restrictions here.  You have the ability to test the icon, screenshots and descriptions all at once, get a read on conversion metrics, and decide how to move forward.

The issue with testing this way – and why your experiment metrics may not sustain long-term – is the lack of clarity on what happened.  If this example of a change to the icon, screenshots, and description shows a 3% uplift in conversion, where did it come from?  There is no way to know, since too many elements were tested at once.

It’s possible that the icon, if tested on its own, could have been shown to increase conversion higher than this 3%, while the other elements could have been shown to drive conversion down.  Not only does this lack of visibility into which element contributed to conversion changes potentially lower overall conversion, it provides no guidance into what elements to test next.  The color of the icon, for example, can be carried through to the screenshots if shown to convert well- or stayed away from if the opposite is true.  By testing everything at once, it’s impossible to get this information and act upon it.

This concept can be carried further into the individual elements themselves.  Screenshots, for instance, can be broken down into multiple elements:

  • Screenshot order
  • Background color or design
  • The text used in screenshots
  • In-app screenshot frame (device around the image, a simple border, etc)
  • Many more design and composition elements

Ideally, these tests on the individual element can be run with no other changes as well.  In doing so, you can see what users respond to best with more certainty – such as highlighting “social aspects” before “security aspects” by changing the screenshot order without changing the design.

Even if the outcome is positive, without breaking down separate components and testing accordingly, all you will be able to understand is that “the new screenshots converted better.”

There isn’t enough data

While it varies from app to app, the amount of time it takes to get enough data back to evaluate an Experiment largely depends on your daily traffic volume.  If you are getting hundreds or thousands of downloads per day, Google will be able to establish a 90% confidence interval for the experiment much more quickly. With enough time, and depending on how the trend goes, Google may also declare which variant is the winner of the test.

But be careful with this information – it could be one of the reasons you don’t see continued conversion growth as projected after the Experiment ends.

Some apps with low traffic volume may take weeks or months to see anything besides “Needs more data” from Google’s Console.  However, other apps with tens of thousands of daily downloads could see a confidence interval and winner declared by Google in just one day.  But this declaration of a winner is simply added to the Experiments UI based on the latest numbers that have come back in, and even Google recommends at the top of the page to allow tests to run for at least two days to eliminate statistical noise.

You may find that a test that was declared a winner one day, flips back into the “needs more data” territory, or even has a different winner declared.  It is recommended, depending on the app, to run tests for at least 7 days to be safe in accounting for any outlier days.  This is particularly critical for apps and games with high or low points on weekends and weekdays; running a test from a Monday to a Wednesday when traffic is low, then expecting it to foretell how your performance will be on Friday to Sunday peak days, maybe why your Experiment is failing.

A shift in trends or traffic has happened

It is always important to test, gather findings, then continue to iterate.  But over the course of several rounds of testing, it’s possible that the initial test must be revisited due to changes in user trends- whether caused by seasonality, user behavior as a whole, or directly by your other App Store Optimization efforts.

One obvious example is seasonality.  An app icon that was run in December might not continue to perform in January- some elements that are not as obvious as Santa hats or snowflakes, such as background color changes, may have fallen out of style.  This concept can be carried further from seasonality into design trends in the Play Store as a whole, such as changes in competitor styles, material design styles, or a change to the UI of the Play Store.

One less obvious example is how a change you make in your App Store Optimization strategy can impact the results of a test.  It’s possible that during testing, a Short Description was starting to emerge as a winner, then suddenly shifted to the underperforming category.  If this happens, it may not mean that your Experiment failed on its own – look into anything you may have done that would cause a shift in traffic.  If, for example, you had primarily driven user acquisition through Facebook at the start of the test, then diverted budget into a TikTok campaign in the middle of the test, the disparity in demographics may show through in the sudden shift in Experiment results.

Just like the pitfall of testing too many variables at once when setting up the test, you must consider the variable of your incoming traffic.  Some of these factors, like changes to user acquisition efforts, can be more or less controlled by you.  Some can be planned and accounted for, such as seasonality.  Others, such as sudden virality or a featuring, can not be controlled but should be kept in mind when assessing the implications of the results of an Experiment.

Overall

Google Play Experiments are an essential tool to use when optimizing for conversion in the Play Store.  But like any tool, you’ll need to know how to properly use it in order to get the job done – and in some cases, not knowing how to do so can cause damage.  If you ever ask yourself “why are my Experiments failing,” take a look at the above factors and see how the tests are set up, how long they are run for, or factors extraneous to the test.  These can give you a better understanding of what the data means and a clearer picture of how to continue to improve.