Processing RVL-CDIP dataset using Spark OCR on Databricks

3 min readMar 25, 2021

We decided to validate some our model on RVL-CDIP dataset. For use it we need to extract text from the images in HOCR format. Let’s do it using Spark OCR on Databricks.

RVL-CDIP

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Downloading and decompress

Dataset hosted on the google drive rvl-cdip.tar.gz. For download it we can use gdown utils:

Decompress it:

Dataset has nested folder structure and when I tried to load data using Spark it was very slow. File listening process did not compete during 1 hour. So I cancel job and decided to flatten it using strip-components option of tar command. It took about 10 mins.

As result we have all images in one directory: