Using Google Cloud Vision to Build ML Training Dataset

While building an OCR system for passports, I needed to crop out MRZ (Machine Readable Zone) from passports. In order to build the training set, I needed ground truth values. MRZ of the passports needed to be annotated. The two popular ways for this were either labellilng manually using tools like LabelImg or outsourcing using something like Amazon Mechanical Turk. I used Google Cloud Vision API rather because I knew it does well with labelling such things and returning bounds within such documents. The following script uses document text annotation example from and improves over it. The code provided by Google needed a fix nonetheless since it hits API with same request multiple times which would increase our cost. API requests were separately made for Page, Paragraph and Word detection while the same could be done by reusing the response from single request. However, we are only using Paragraph detection to find the bounding box for MRZ. The modified script also optionally uses Redis to store the serialized vertices of the bounds and reuses them if the same image needs to processed again. The script is only used for building training data but not for production. The output of this script is used for training for pattern matching which is then used in production. This same technique can be used to detect other sections in passport or other document types. Just adjust the index in the assignment of `target_bound` variable. Before running the script, you need to download your GCP credentails set the path to the credentials json as GOOGLE_APPLICATION_CREDENTIALS environment variable. export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/creditials_file_name.json" import argparse from enum import Enum import io from import vision from import types from import BoundingPoly from PIL import Image, ImageDraw use_redis = False try: import redis try: cache = redis.Redis(host='localhost', port=6379, db=0) # execute a command to test connection cache.client_list() use_redis = True except redis.exceptions.ConnectionError: pass except ImportError: pass class FeatureType(Enum): PAGE = 1 BLOCK = 2 PARA = 3 WORD = 4 SYMBOL = 5 def draw_boxes(image, bounds, color): """Draw a border around the image using the hints in the vector list.""" draw = ImageDraw.Draw(image) for bound in bounds: draw.polygon([ bound.vertices[0].x - 10, bound.vertices[0].y, bound.vertices[1].x + 10, bound.vertices[1].y, bound.vertices[2].x + 5, bound.vertices[2].y, bound.vertices[3].x, bound.vertices[3].y], None, color) return image def get_document_bounds(document, feature): """Returns document bounds given an image.""" bounds = [] # Collect specified feature bounds by enumerating all document features for page in document.pages: for block in page.blocks: for paragraph in block.paragraphs: for word in paragraph.words: for symbol in word.symbols: if feature == FeatureType.SYMBOL: bounds.append(symbol.bounding_box) if feature == FeatureType.WORD: bounds.append(word.bounding_box) if feature == FeatureType.PARA: bounds.append(paragraph.bounding_box) if feature == FeatureType.BLOCK: bounds.append(block.bounding_box) if feature == FeatureType.PAGE: # noinspection PyUnboundLocalVariable bounds.append(block.bounding_box) # The list `bounds` contains the coordinates of the bounding boxes. return bounds def render_doc_text(directory): from os import listdir, path dir_content = [f for f in listdir(directory) if not '_bounded' in f] files = [f for f in dir_content if path.isfile(path.join(directory, f))] for file_name in files: file = path.join(directory, file_name) client = vision.ImageAnnotatorClient() with, 'rb') as image_file: content = target_bound_str = cache.get(file) if use_redis else None if target_bound_str: target_bound = BoundingPoly().FromString(target_bound_str) else: # noinspection PyUnresolvedReferences image_type = types.Image(content=content) response = client.document_text_detection(image=image_type) document = response.full_text_annotation bounds = get_document_bounds(document, FeatureType.PARA) if len(bounds): target_bound = bounds[-1] target_bound_str = target_bound.SerializeToString() if use_redis: cache.set(file, target_bound_str) # target_bound_str exists implies target_bound exists if target_bound_str: image = # noinspection PyUnboundLocalVariable draw_boxes(image, [target_bound], 'green') sans_ext, ext = path.splitext(file) out_name = sans_ext + '_bounded' + ext if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('directory', help='Directory where image files are.') args = parser.parse_args() parser = argparse.ArgumentParser() render_doc_text(