How we combine different (AI-driven) techniques to detect passport scans

Our Nalytics Search & Discovery platform has been helping companies structure their data for many years. Over time our clients have asked us if we can use our software to help them find non-text sensitive files such as scans and photos of passports, ID cards, Driver’s licenses and residence permits.

At many companies, scans of identity documents roam around in Sharepoint environments, shared disks or mailboxes. Given GDPR regulations, companies need to clean this up. But to be able to delete these images, you must know in which files they are located and how to access them, which is often complex.

Scanned identity documents are often scattered over various servers, shared discs, and mailboxes. When the GDPR laws came into effect companies needed a way to remove these unsecured images, but to do this they need to know where the images are.

Observing that this would be a problem for many of our customers we have developed a technique that detects all images of identity documents. In this blog I will explain how we do it.

First, all files of the organisation are indexed. During this process, images and PDF files are OCR’d. This means that all letters and punctuation marks in images are converted to text. We choose to OCR only images above a certain size. This means we can avoid OCRing every company logo or email signature the software discovers.

After all data has been indexed, we run an extensive search query. We search for all possible word combinations that occur in identity documents but, take into account all possible mistakes that were made during the OCR process. If passport photos are of low quality, letters can be misinterpreted, but obviously you do not want to exclude these images.

A lot of research has gone into compiling this query. You don’t want to find too many false positives, but you certainly don’t want to find any false negatives. That would mean that we would overlook scans of identity documents.

By using this search query, we are left with a subset of files that largely consist of files in which an ID has been found. We now perform a final check to see whether these files actually contain ID images.

The customer eventually receives a report containing all these files including the precise path of the file. The customer can choose to clean up these files themselves based on the report, or we will remove all images for them. The image below shows an example of what a PDF file might look like before and after editing by Nalytics.

Organisational data can sometimes be many terabytes in size, which means that the costs of operations such as this can quickly become high. Because we combine different techniques and make subsets of the original data set in various steps, we manage to keep costs low.

Want to know what we can do for your organisation? Book a demo or contact us for more information.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Language

How we combine different (AI-driven) techniques to detect passport scans

Subscribe To Our Newsletter

Get in touch

About Us

Telephone

Quick Links

Recent Posts

Leeds United Football Club choose Nalanda’s solution

Peter O’ Hara, founder and CEO of care tech company OLM, made MBE

Useful Links