Automated UI testing and catching visual regressions

Automated UI testing is probably one of the most discussed - and most painful - areas in the quality assurance (QA) field. Automated UI tests often tend to be flaky, give false negative results, and, as such, cannot be used in the Continuous Integration (CI) process. And, when QA engineers try to fix this flakiness, the tests tend to become more benevolent and can even stop catching real issues.

Another common pain point is that UI development frameworks and tools evolve very dynamically, but the same cannot be said about tools for automated UI testing. The majority of existing tooling are Selenium-based and thus rely on DOM model. The other tools/services go with visual testing (e.g. screenshot comparison), but they are often based only on a far-too-restrictive pixel-to-pixel comparison.

At Showmax, we’ve decided to combine both approaches. Here’s a rundown of the advantages and disadvantages for both of them.

  Advantages Disadvantages
Selenium-based Framework - Support of major browsers
- WebDriver bindings available in many languages
- API allowing navigation through web pages and test business logic
- Labor intensive
- High maintenance costs
- Stability issues
- May not catch real issues, only tests inner structure (DOM) and not how a page looks
Image Comparison Tools - Easy test development and maintenance
- Solid stability (changes in DOM do not necessarily affect visualisation)
- Ability to catch unexpected regressions (across browsers and resolutions)
- Ability to test responsive design
- Dynamic regions cause issues with image comparison
- Native tools usually use restrictive pixel-to-pixel comparison
- Cloud solutions are expensive and less flexible

This table could certainly have more items, but these are the main points that convinced us to come up with our own solution for screenshot comparison.

Showmax visual regression testing tool (imagediff)

The main requirements for our new tool were stability, simple integration with existing tests, and simple maintenance. On top of that, we added a few more things to the list:

  • UI console (more below): We wanted a nice, easy-to-use UI that would show the final results (difference in comparison of baseline and tested images) and allow us to perform user actions such as:
    • Easy exclusion of unwanted regions
    • Replacement of current baseline with new images
  • Configuration of colour differences
  • Configuration of aliasing range

The last two points ended up being really crucial. It turned out that browsers behave very differently, not just across platforms but also across versions. A good example are PNG images. They are lossless but different browsers have different color management (see more here) and, as a result, the pixel comparison differs - sometimes quite significantly.  

This can really become headache if the comparison tool won’t allow the user to set color ranges to neglect such differences. The same applies for an aliasing range which helps to overcome fonts, rounded corners, and other rendering issues.

Now, what happens if I want to test my changes and I don’t have the same environment (browser, installed fonts, …) on my machine? The answer is simple, it doesn’t work. You need to make sure the testing environment is as consistent as possible all the time. We solved this problem with the BrowserStack service, but we’ll get to it later.

Architecture

We spent some time investigating possibilities and available libraries that could be used for implementation. Finally, we narrowed down the selection to OpenCV (Python) and blink-diff (JavaScript). Even though OpenCV is far superior in every way, we decided to go with Yahoo’s blink-diff.

Why blink-diff? Well, the main reason was that blink-diff actually does everything we need and is very simple to use. We didn’t want to involve complex libraries when everything could be done within a small, convenient package.

The blink-diff library itself only compares two images. We had to wrap it into something that could be easily used within our tests in order to put minimum work requirement on users.

Implementation went through several iterations and feedback loops, resulting in the addition of several new options to the tool:

  • Ability to compare whole directories with images - baseline and screenshots directories
  • Easier configuration of comparison parameters - There are quite a lot of them, so only aliasing and color delta are allowed to be set by a user. The rest are hardcoded with empirically-found and tested values.
  • Quick comparison mode - UI console used predominantly in the baselining algorithm is not automatically started.

UI console

UI console is a browser user interface running on Node.js. As mentioned above, it provides a user friendly way to view test results and perform maintenance operations, and it’s started each time a comparison test fails. Here’s what it looks like:

UI console UI console showing one test.

Clicking on an image in the UI console gives you a detailed full screen view:

Detail Detail of mismatches between baseline and selected test results.

There is a baseline screenshot at the top, a new screenshot at the bottom, and the differences are highlighted in the middle.

Integration of imagediff with UI tests

Having a tool which can compare images is one thing, but using it in real world testing projects is quite another. Our goal was to provide this tool, equipped with the latest updates, to any team in Showmax independently and in the programming language the team uses. We didn’t want anyone to have to reimplement it from scratch for each platform.

We created a pipeline that builds libraries for the required platforms - currently, npm packages and ruby gems. All libraries are built as very thin wrappers which start the imagediff and gather results.

Let’s have a look on simple examples to better understand how to use it on different platforms.

Example of integration with Ruby

    require 'sm_imagediff'
    imagediff = ImageDiff.new('/baseline/path','/captured/path')
    result = imagediff.compare_images
    if result.failed?
      # handle comparison failure
    end

Example of integration with JavaScript

    const { compareImages, Status } = require('imagediff');

    const result = compareImages({
      baselinePath: '/baseline/path',
      capturedPath: '/captured/path'
    });
    if (result.status !== Status.IMAGES_SAME) {
      // handle comparison failure
    }

Website real world usage

We didn’t want to build a tool with too many complex features. The idea is that it should be as generic as possible, so as to empower each team to use the tool in its own way and to ensure that the common source code is not extended with platform-specific stuff. Here’s how this is done for the Showmax website.

The first task that had to be solved was baselining - taking screenshots that are used as a baseline for further testing. Let’s say one small test takes 5 screenshots, multiplied by 3 browsers, and 3 resolutions. It gives us 45 screenshots for a single test. It’s certainly not feasible to do this manually.

Screenshots can be easily collected by any Webdriver binding. The website team uses webdriver.io because of its Node.js implementation.

Taking screenshots is not as simple as it initially seems. It turned out that Webdriver waits for DOM to be ready and then takes a screenshot. Unfortunately, what happens is that a page with lazily-loaded resources in the background is not fully-rendered when a screenshot is taken.

So, the tricky part figuring out how to recognize that a page was fully-rendered during automatic baseline collection. This is where imagediff comes in to play. The idea is pretty easy and is best-described with a flow diagram:

Flow diagram

A sliding window of screenshots is iteratively compared until the two last screenshots match, and the last screenshot in a successful window is saved as a baseline. The time between iterations increases linearly and, in case of a failure, the last screenshot is saved.
This makes the test development very easy. All we needed to do was write a test code and run it for the first time. If there are no baseline screenshots, they will be collected. Then the developer needs to check them visually to see if they match expectations.
The algorithm ensuring test stability is pretty much the same as when a baseline is created, and further test executions run in a real testing mode, comparing newly-taken screenshots with the existing baseline. Each screenshot collection is repeated if the comparison fails within five iterations. Similarly, time between iterations increases linearly.

Infrastructure

The last technical task was to create a reliable infrastructure in which such test scenarios could run as a part of the CI pipeline. The solution was to extend our current infrastructure, based on GitLab and Jenkins, with BrowserStack and Git LFS.

Why BrowserStack?

As discussed, browser type, version, and operating system all have an outsized effect on screenshots.

This means that any team of more than one person needs to have a stable environment where all the baselining and subsequent testing happens. Rather than building it from scratch and maintaining it on our own, we decided to go with BrowserStack. This service provides several operating systems with dozens of versions for all major browsers. Parallel test execution is solved, as is request queueing, and it has nice tooling for test debugging.

Why GIT LFS?

We demonstrated how many screenshots are taken for one simple test on the website example above. It is not a good idea to store a baseline of these screenshots in a standard Git repository because, with each new revision, the repository grows in size and cloning a Git repository means to download its entire history, including all revisions.

With that many screenshots and their revisions, the repository could easily end up with gigabytes of data. Git LFS replaces large files with text pointers inside Git, while storing the file contents on a remote server. This means a clone will only download a tiny piece of data for each revision.

Conclusion

We have been using this way of testing for the last several months and the results have been largely positive. It’s helped us discover several bugs that we wouldn’t notice with Selenium-based tests. The red box in the image below highlights an issue that was caught on our sign in page - one that could have been easily missed by human eyes.

Issue

The tests have been stable throughout, and we haven’t experienced a single failure due to a bad comparison.
On the negative side, we have been struggling with fine-tuning the blink-diff comparison parameters to ensure test stability and reliability. This has cost us a few round-trips with development teams.
Their tests were either failing because the parameters were too strict, or they weren’t catching regressions if the parameters were too “forgiving.” That was also the reason why we decided to limit these options for users.

There is definitely a lot of space for improvements, and we’d love your feedback. Check the open-sourced imagediff on our GitHub and feel free to comment here, find on social, or send an email, and let’s talk!

Please check the original version of this article at