Spark OCR: How to calculate cluster size for processing 1 million documents
Input:
1 m doc, 4 avg page/doc, avg Spark OCR pipeline from John Snow Labs
For process one page using one CPU need from 2 seconds to 10 seconds. This related to the following things:
- format of document
- resolution of image
- quility of image
- number of steps in pipeline
Examples of different pipelines you can look in the workshop repo.
So let’s use as average 5 seconds per page and calculate total time in seconds and hours:
1m * 4 * 5 = 20m seconds = 5555 hours
For start let’s use sigle node cluster with 32 CPU’s:
5555 / 32 = 174 hours
8 worker nodes with 32 CPU’s:
5555 / (32 * 8) = 22 hours
So for process 1 million documents using Spark OCR enough 8 worker nodes with 32 CPU during 1 day or single node cluster for 1 week.