Spark OCR: How to calculate cluster size for processing 1 million documents

Input:
1 m doc, 4 avg page/doc, avg Spark OCR pipeline from John Snow Labs

For process one page using one CPU need from 2 seconds to 10 seconds. This related to the following things:

  • format of document
  • resolution of image
  • quility of image
  • number of steps in pipeline

Examples of different pipelines you can look in the workshop repo.

So let’s use as average 5 seconds per page and calculate total time in seconds and hours:
1m * 4 * 5 = 20m seconds = 5555 hours

For start let’s use sigle node cluster with 32 CPU’s:

5555 / 32 = 174 hours

8 worker nodes with 32 CPU’s:
5555 / (32 * 8) = 22 hours

So for process 1 million documents using Spark OCR enough 8 worker nodes with 32 CPU during 1 day or single node cluster for 1 week.