Aws extract text from pdf

11/23/2023

In the next post, we will attempt to reconstitute the Blocks into their original pages, and then parse out the desired elements for comparison.

Fortunately, time has moved on, and like many things, it might be possible now. A json is a common object for API calls, but when introduced to a json a few years ago, it seemed to be a hopelessly, impenetrable data structure, and one to be avoided if at all possible. In the next step, we have our work cut out to extract the key elements from the json. To give an idea of the sizes involved, the full aggregated PDF resulted to a 210 MB json, once we had called for all the Blocks. This also took time, and our 700 pages of tables came back as over 400,000 blocks commingled together in a json object. The only way we could figure out to get the data back into our environment was to while loop over get_document_analysis in the maximum 1,000 increments. We searched around, but it doesn’t seem possible to download the whole job with all of the pages at one time. A few things to mention, Textract stores pages from all of the documents together in “Blocks” when called in bulk from a PDF. "fee4fb4042e7b5d21949d17b211e6bdbc6a3441939d28480f0e858ac98f1e0a5"īelow, we show our call to paws get_document_analysis() using the JobId we received back from Textract above. # Output is JobID used for "get_document_analysis" # Textract function is "start_document_analysis" which asynsychroniously for PDF Secret_access_key = key_get("AWS_SECRET_ACCESS_KEY") # Set up Amazon Textract objectĪccess_key_id = key_get("AWS_ACCESS_KEY_ID"), Once completed, Textract returns the JobID (shown below) which is required to get the analysis in the next step. Running Textract on our 700 pages took more than an hour, so another step would be to figure out how to be notified of the completion with AWS SNS. The docs say that start_document_analysis() uses asynchronous analysis to look for relationships between key-value pairs. It is possible to get help by using ?textract or ?start_document_analysis just like any other function in R. Note that we select TABLES, but other parameters are FORMS or FORMS | TABLES. Next, we set up a Textract response object ( svc below) and use start_document_analysis() to process the pages in the code below. Setting up Textract Object and Calling Start Document Analysis We showed “munisubset” bucket at the bottom of the code below. Another mistake we made was uploading the PDF from outside our current working directory, because S3 created the directory structure to match our disc, and Textract seemed to be unable to navigate the file structure to find the document. When setting the bucket names, it is important not to include punctuation, because these will be rejected. We then input our AWS credentials and establish an S3 response object ( s3 below), which we use to instruct AWS to create a S3 bucket, and then upload our subset file of PDFs to S3.

Setting up an S3 Bucket and Uploading a PDF # Extract 5-pages from Atteboro and Hudson CAFR PDFs with pdftoolsĪs.integer(names(readRDS(paste0(path, "mass.RDS"))]])) Path <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/" In our full project, we aggregated five pages from every CAFR in Massachusetts (30MB file for 700 pages) for a total cost of $11. In this case, we will show how to subset five tables from Attleboro CAFR which failed to scrape three out of five desired fields, and Hudson, MA where the PDF couldn’t be found on the town’s website and is probably an image. A logical workflow seemed to be to try tabulizer for free, where possible, and then pay for cases where the document can’t be extracted with tabulizer or the error rate is expected to be high. This option also doesn’t require upload to an S3 bucket.Įxtracting from bulk PDFs, which we used, costs $0.015 per page up to 1 million pages using their asynchronous API on documents which are in an S3 bucket.

The API allows to manually upload up to 10 pages and get back a response, and second option of up to 1,000 pages a month for PNG formats for the first three months. Textract offers a number of alternatives for using OCR to extract structured text, forms and tabular data.

0 Comments

Aws extract text from pdf

Leave a Reply.

Author

Archives

Categories