Friday, 7 June 2024

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Introduction:


Document Information Extraction helps you to process large amounts of business documents that have content in headers and tables. You can use the extracted information, for example, to automatically process payables, invoices, or payment notes while making sure that invoices and payables match. After you upload a document file to the service, it returns the extraction results from header fields and line items. 

Use case:


1. Extract the documents (invoice detail) from an application where it is maintained as an attachment, and it is stored as a blob object in HANA database tables. 
2. Before the data is imported into a HANA database, transform the information that was retrieved from the blob object into a format that can be utilized for further analysis. 

Key services used in this solution: 


1. SAP Document extraction service – AI Business Service. 
2. SAP Cloud foundry - Runtime Environment 
3. SAP Business Application Studio – Development Environment. 
4. SAP HANA Cloud – Database to store extracted information. 

In this blog, primarily we will focus on how to read the invoice file stored as Blob and extract required information using python client library for SAP AI business service: Document information extraction.  

CAPM (Cloud Application Programming Model) Application: 


Create a simple CAPM application with UI to upload and maintain invoice file as an attachment. This application's objective is to show how to define a field as an attachment which can be used to upload and maintain file as blob object in backend HANA table. 

Prerequisite: 

  • Log on to BTP trial cockpit. -> Click on "Go to Your Trial Home" -> Click on the subaccount, "trial”. 
  • Click on the "Services" option in the left-hand panel and further click on "Instances and Subscriptions.” 
  • Under the "Subscription", you can now see the SAP Business Application Studio. Click the link to open the same. Business application studio (BAS) will now open in another tab of your browser. 
  • Access BAS with your login credentials and click “Create Dev Space.,” here I am using dev space name as “Local” and application type selected is “Full Stack Cloud Application.” 
  • Now the dev space is up and running, and the business application studio for application development is ready. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step1: Create Project 

  • Click on the three-line button. -> Choose option File -> Select “New Project from Template”. 
  • Select template as “CAP PROJECT” and click next. 
  • Enter project name and add features “Configuration for HANA deployment”, “MTA based BTP deployment” and click finish to create CAPM Project (CAPMDOCEXT).

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step2: Create DB, Service and UI Artifacts

  • Create a file with extension .cds under DB folder to maintain database related content. Here I am using “docext_schema” as file name. 
  • Add code as shown in below image in file “docext_schema.cds”

Extract blob data (PDF) from CAPM using python library of Document information extraction service

  • Document_uploaded is the column/attribute which holds file uploaded via UI as blob in HANA table. 
  • The Filename column holds the name of the file uploaded. 
  • Mediatype column holds the format/extension of the file uploaded. 
  • Add code as in below screenshot in docext_service.cds file under SRV folder to create service for the application. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

  • capmdocext-db is the HANA HDI service created for this application. 
  • Bind the application to HANA HDI service. 
  • Create fiori application by following the steps :  right click on mta.yaml file ---> select create mta module from template ---> click sap fiori application --> select “list report page”. 
  • Configure source and deployment target for fiori application as shown in below screenshot. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step3: Run and Test the CAPM application from Local. 

◉ Run command cds watch –profile hybrid to launch the application from local (This will start the CAP service locally by binding the application to remote HANA instance).

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Click create button to upload the invoice file into CAPM application as shown in below screen shot. Here sampleinvoice.pdf has been considered for testing. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Below screenshot shows the file uploaded via fiori, which is stored as blob in backend table of HANA HDI container. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Add deployment configuration for CAPM and deploy the application to cloud foundry. 

Document Information Extraction using python library: 

Step4: Setup Document extraction service, upload sample file and validate the fields. 

  • Go to BTP account and click Booster from Navigation side bar. 
  • Select “Set up account document information extraction” and click start to create the service. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

  • Confirm that Document Information Extraction service and Document Information Extraction Trial UI is available in subaccount.  
  • Add Document extraction service-related roles (Document_Information_Extraction_UI_End_User_trial, Document_Information_Extraction_UI_Document_Viewer_trial & Document_Information_Extraction_UI_Templates_Admin_trial ) to the user. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

  • In below steps we will see how to manually upload the file and validate the extracted information using document information extraction UI service.
  • Click “Document information Extraction Trial” to open the UI service. 
  • Click + button at right top of UI application to upload the invoice file selected for validation. 
  • Choose the document type as Invoice and upload the file (Sampleinvoice.pdf) 
  • Select the fields/column to be extracted in Header and Line item of invoice and click confirm. 
  • Once the status changes from pending to ready, click “Extraction Results” to preview the value extracted from file and confirm it is same as PDF content.

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step5: Get the value from Document extraction service key to establish connectivity.  

  • DOX API python library is the library used to establish connectivity to document extraction service. Import the service in python program using command” from sap_business_document_processing import DoxApiClient”. 
  • Below 4 values are needed for communicating with the Document Classification REST API 
    • url: The URL of the service deployment provided in the outermost hierarchy of the service key json file. 
    • uaa_url: The URL of the UAA server used for authentication provided in the uaa part of the service key json file. 
    • clientid: The clientid used for authentication to the UAA server provided in the uaa part of the service key json file. 
    • clientsecret: The clientsecret used for authentication to the UAA server provided in the uaa part of the service key json file. 
  • Click view credentials of document extraction instance to get parameter values from service key. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step6: Create a python application to read the invoice file maintained as blob in application db. 

  • Create a folder in your CAPM project to maintain Python microservice artifacts. Here I am using “pythonapp” folder to maintain all artifacts related to python app. 
  • Create a manifest.yml file as mentioned in below screenshot. HANA HDI Service created in CAPM application is configured as service in yml file and application name is maintained as “blobextract”. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Create blobextract.py file and maintain python code to read blob object and extract invoice detail from file. 

1. Import the libraries required to connect, upload the file into document extraction service, connect to HANA DB, Flask web framework, panda’s libraries, etc.  

Extract blob data (PDF) from CAPM using python library of Document information extraction service

2. Add below code to connect HANA HDI container (capmdocext-db), query the Table column where files uploaded are maintained as blob object and preview the file in web browser. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Create a runtime.txt file and specify the Python runtime version that your application will run on. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Create requirements.txt and maintain all dependencies as mentioned in below screen shot. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Deploy the python application using command “cf push” from pythonapp root folder to get application deployed in cloud foundry. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ https:// ********.cfapps.eu10.hana.ondemand.com is the URL of application deployed in cloud foundry.  
◉ Open browser and paste below URL with extension as preview and input parameter as filename uploaded in CAPM to preview the file https://********.cfapps.eu10.hana.ondemand.com/preview/filename=sampleinvoice.pdf

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Step7: Extend the python code to upload the invoice file maintained as blob into document information extraction service and load extracted information into HANA schema. 

◉ Add code as in below screen shot to open the file maintained as blob, connect to Document information extraction service, upload the file into Document information extraction service , extract the header and line item defined to be read, connect to HANA Staging schema, load the extracted information in HANA table. 
◉ Establish connection to document extraction service by passing (url, client_id, client_secret, uaa_url) to DoxApiClient. (Refer step 5 to get details on how to get the parameter to establish connection to document extraction service) 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Define the columns to be extracted as shown in below screenshot. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Pass the filename, header fields, line items fields and document type as in below screen shot to extract information of invoice file. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Connect to HANA schema to load extracted information into HANA table. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Here we considered only header information extracted for data load and the same logic can be applied to load invoice line-item data. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Add below code to load extracted data into HANA schema. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Please refer below screenshot for complete code to extract file information using document information extraction service ,load extracted data into HANA table and return the data stored in table as an output. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Push the python application with newly added code to perform Document upload into document information extraction service, extract invoice detail and load into HANA DB. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Open browser and paste below URL with extension as extract and input parameter as filename uploaded in CAPM to upload document into document extract service and to load extract data into Invoice table maintained in HANA DB https://********.cfapps.eu10.hana.ondemand.com/extract/filename=sampleinvoice.pdf 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Sampleinvoice.pdf maintained as an attachment in CAPM application is read and uploaded into document information extraction service using python microservice. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

◉ Information extracted through document information extraction service is loaded into HANA DB through python code. 

Extract blob data (PDF) from CAPM using python library of Document information extraction service

No comments:

Post a Comment