There is a project I am working on to digitize customs documents. I can say that I was inspired by Thunk. I’m almost done. I write in Python.
Step 1: Function1 downloads the attached emails in Microsoft Outlook to disk. ‘\Address\Domain\Sender’sMailAddress’ automatically creates folders and saves attachments.
Step 2: Function2 scans these folders or subfolders of the declared folders and understands that the PDF document (with Tesseract-OCR) is a customs document. The code adds the customs documents it finds to another folder.
Step 3: Function3 names the file within the Customs document as ‘date_exported-companyName_DeclarationID.PDF’.
Step 4: Function4 retrieves the information from the manually edited CSV file, which fields are stored in which coordinates, with PDFMiner (Python library). (If the PDF file is an image rather than text, another library is used)
“CSV structure is: fieldName, coordinates”
StunName, x0, y0, x1, y1
Declaration type, 344, 728, 358, 741
Declaration ID, 420, 681, 513, 692
Declaration DATE, 330, 150, 390, 165
Currency of the declaration, 418, 726, 485, 740
…
It scans the CSV file in a loop, writes the field names to the Excel file, and writes the information in the relevant coordinates to the relevant columns. Another function then sends them to an ERP program with win32com support.
Now let’s get to the main event, as far as I read on Yahoo and various Medium blogs, it is said that the size of OCR technology will reach 40 billion dollars in 2030.
Here is a link = ( [Economic scale of OCR by 2030.]([Smart OCR – Advancing the Use of Artificial Intelligence with Open Data – New Jersey State Policy Lab (OCR)%20is,as%20estimated%20by%20Straits%20research.]
So the cake is big, you decide if it’s worth the effort.
But what I want to ask is this:
Add a new feature to Thunk (I don’t think AI will do this)
a- Let’s upload the PDF file to Thunk for training
b- Let’s choose which information is stored in which coordinates. (It can be with the help of the mouse. Or another way.)
c- Let’s match the coordinates with our own database table.
d- Then, let’s send the document to Thunk via API or by e-mail.
e-Thunk sends the PDF file, fields and coordinates sent via API or e-mail to artificial intelligence.
f- Let AI send the data to the application using our application’s API.
Now this is a model, there are n document types, n databases, n customers. Using such a model, can we turn our Thunk’u into an OCR server?"