Processing RVL-CDIP dataset using Spark OCR on Databricks

Mykola Melnyk
3 min readMar 25, 2021

We decided to validate some our model on RVL-CDIP dataset. For use it we need to extract text from the images in HOCR format. Let’s do it using Spark OCR on Databricks.

RVL-CDIP

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Downloading and decompress

Dataset hosted on the google drive rvl-cdip.tar.gz. For download it we can use gdown utils:

Decompress it:

Dataset has nested folder structure and when I tried to load data using Spark it was very slow. File listening process did not compete during 1 hour. So I cancel job and decided to flatten it using strip-components option of tar command. It took about 10 mins.

As result we have all images in one directory:

Load to the Spark

For process it using Spark cluster need to put dataset to some distributed storage. We have few options here: DBFS, S3, Azure Blob Storage

I tried to copy dataset to DBFS, S3. But it was very slow so I decided to use single node cluster for process data from local file system.

Read images dataset as binary file to the Spark Dataframe:

With current file structure and from local filee system it was very fast.

Display few images using display_images:

Define OCR pipeline

Need only two transformers in this case: BinaryToImage and ImageToHocr:

Call pipeline:

Process 10 records and display results:

Proccess whole dataset and store results to S3

Done!

--

--