Best of the Test: Using character recognition for browser automation

Introduction

I was venting recently to my colleague Chris Johnson about the frustration of working with yet another horrendous website that didn’t have any usable locators and he put to me the idea of “what if we had visual driven automation, where the automation treats the website as a complete black box, like a human user?”. It got me thinking, surely this could be possible already in some respect. As I am already very familiar with AWS (Amazon Web Services), I had a look if they had any services that could be useful for this and found their brand new Textract service, which lets you send image documents (.jpg, .png or .pdf) and analyses them for text. This then gave me the idea to try and use this service with Selenium to find elements on a page and click them without any need to interact or understand the HTML. This post is a report on my findings.

What is Textract and how does it work?

Textract is a type of OCR (Optical Character Recognition) service that detects text and data in image documents. It works by supplying it an image file and it responds with the results of its analysis, a list of words, sentences and objects (like forms and tables) that it has identified. Each word or sentence has a set of data including the location in terms a rectangular box and a percentage confidence of the accuracy of its findings. Behind the scenes it manages this via a pre-trained machine learning algorithm.

As with other AWS services, you can access Textract via the AWS CLI so for my research I used the boto3 library for Python3 which allows you to interact with the AWS CLI within Python very easily.

When you are using Textract, you receive JSON responses that look like this:

{
        "Blocks": [
            {
                "Geometry": {
                    "BoundingBox": {"Width": 1.0,"Top": 0.0,"Left": 0.0,"Height": 1.0},
                    "Polygon": [
                        {"Y": 0.0, "X": 0.0},
                        {"Y": 0.0, "X": 1.0},
                        {"Y": 1.0, "X": 1.0},
                        {"Y": 1.0,"X": 0.0}
                    ]
                },
            "Text": "Store",
            "Confidence": "99.1231233333",
            "BlockType": "WORD"
        }
        ]
}

My test

So the idea for my test was to create a simple script that would:

Use Selenium to load a website.
Take a screenshot.
Send the screenshot to Textract.
Find the location of some text on the website that would take us to a new page.
Click the location on the website.
Check that we had gone to the correct new page.

Pre-requisites

In order to get this test working, this is what I needed:

A computer with Python3 installed and the relevant libraries (boto3, Selenium).
Chromedriver and Chrome installed.
An AWS account with access to Textract. At the time of writing Textract is in preview and you need to ask Amazon nicely for access. It took about a week for me to get access.
An AWS IAM (Identity and Access Management) user on my AWS account that has the permissions to interact with Textract.
AWS CLI installed on my machine and configured with the IAM user and an AWS region that Textract is available in (its currently only available in a few regions).

I chose to use Python because I’m very familiar with it, and it allows me to write a simple script like this very rapidly with little setup needed. You can use any programming language, you just need to be able to interact with the AWS CLI.

My findings

Did it work? Yes, effectively. I was able to create a script that did all of the above and successfully navigate between pages...but with a very big caveat - only if the text appeared in the top left corner of the page in any resolution. Why? When you request a screenshot from web browsers, they do not take a screenshot of what you can see, but instead they take a full page screenshot. This means you cannot easily map the pixel locations from the screenshots to the browser window, particularly when the real browser window is a smaller resolution and the website dynamically changes visible location of elements. Even at larger resolutions there is a small difference in resolution sizes because you almost never display the full page in a window.

This means that my idea has limited usefulness until I could find a better way of generating screenshots that are just the visible area rendered by the browser.

However, in principle my idea did work if you can provide an image that maps 1:1 with the browser window, but there are still other issues. I tried testing against different websites just to see what other issues would crop up:

Naturally you need relatively clear text in order for Textract to confidently find it. However, this could be argued that unclear text would be an accessibility issue anyway and in the below screenshot you can see Textract does a reasonable job of even reading unusual fonts or text that isn’t perfectly straight.

Obviously if you want to find elements that are not text, Textract isn’t useful at all. Potentially other pattern recognition services could be useful though if you could provide it an image pattern to find rather than just a text string.
It could be tricky to figure out which is the right element if there is more than one example of some text on the page.
While Textract is quite fast (roughly 2 seconds or so for it to return a response), some dynamic elements of the page could change in that time. Some website have scrolling elements or pop-ups that only appear after a short time. This means you might not be able to totally run a black box test as you need to check for these elements before proceeding.

Due to these issues, I don’t think you could totally remove the need for HTML locators yet. But there is certainly some potential here to expand the abilities of automation tests and maybe make parts of them more powerful. Even if you don’t use this tech to replace locators for Selenium, there are other applications that might not have been easy before - such as testing PDF files generated by the systems you test.

Other considerations

If you were to consider using this tech in your automation, there are some other factors to consider:

Costs, AWS charge for the use of Textract. Depending on your context and how you use it, this may be a limiting factor. The pricing can be found here: https://aws.amazon.com/textract/pricing/
Data privacy, AWS state they may store and use the documents you upload to maintain and improve the machine learning algorithm behind Textract. You may work in a workplace where the screenshots contain sensitive information that AWS employees are not authorised to look at. You can request AWS delete documents you upload, but this is a manual process of contacting their support team.
While I did find Textract responded quite quickly (around 2 seconds), I don’t know its capabilities under significant load nor did I hit the API request limits. I was only running very low numbers of requests.
Textract is currently only available in certain AWS regions, so if you require data be held in particular geographic locations then this is may be a limiting factor.

Of course, other services similar to AWS Textract may be available which overcome these factors.

Summary

This was a fun little exploration into technology I’ve not used before and it was nice to find it so easy to use. I think there is some potential in using this service for automation tooling, particularly if you need to analyse image documents.

If you’re looking to totally remove the need for HTML locators, I think there is more work required though, if you know of a good solution to the browser screenshot resolution problem let me know!

My code

Here’s my code if you’d like to try it yourself:

https://github.com/Ardius/python-selenium-textract/blob/master/textract_test.py

Friday, 17 May 2019

Using character recognition for browser automation