How we combine different (AI-driven) techniques to detect passport scans

Our Nalytics Search & Discovery platform has been helping companies structure their data for many years. Over time our clients have asked us if we can use our software to help them find non-text sensitive files such as scans and photos of passports, ID cards, Driver’s licenses and residence permits.

At many companies, scans of identity documents roam around in Sharepoint environments, shared disks or mailboxes. Given GDPR regulations, companies need to clean this up. But to be able to delete these images, you must know in which files they are located and how to access them, which is often complex.

Scanned identity documents are often scattered over various servers, shared discs, and mailboxes. When the GDPR laws came into effect companies needed a way to remove these unsecured images, but to do this they need to know where the images are.

Observing that this would be a problem for many of our customers we have developed a technique that detects all images of identity documents. In this blog I will explain how we do it.

First, all files of the organisation are indexed. During this process, images and PDF files are OCR’d. This means that all letters and punctuation marks in images are converted to text. We choose to OCR only images above a certain size. This means we can avoid OCRing every company logo or email signature the software discovers.

After all data has been indexed, we run an extensive search query. We search for all possible word combinations that occur in identity documents but, take into account all possible mistakes that were made during the OCR process. If passport photos are of low quality, letters can be misinterpreted, but obviously you do not want to exclude these images.

A lot of research has gone into compiling this query. You don’t want to find too many false positives, but you certainly don’t want to find any false negatives. That would mean that we would overlook scans of identity documents.

By using this search query, we are left with a subset of files that largely consist of files in which an ID has been found. We now perform a final check to see whether these files actually contain ID images.

The customer eventually receives a report containing all these files including the precise path of the file. The customer can choose to clean up these files themselves based on the report, or we will remove all images for them. The image below shows an example of what a PDF file might look like before and after editing by Nalytics.

Organisational data can sometimes be many terabytes in size, which means that the costs of operations such as this can quickly become high. Because we combine different techniques and make subsets of the original data set in various steps, we manage to keep costs low.

Want to know what we can do for your organisation? Book a demo or contact us for more information.