Skip to main content

LlamaParse: Incredibly good at parsing PDFs

 What is LlamaParse?

LlamaParse is a proprietary parsing service that is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.

It directly integrates with LlamaIndex ingestion and retrieval to let you build retrieval over complex, semi-structured documents. It is promised to be able to answer complex questions that weren’t possible previously. This service is available in a public preview mode: available to everyone, but with a usage limit (1k pages per day) with 7,000 free pages per week. Then $0.003 per page ($3 per 1,000 pages). It operates as a standalone service that can also be plugged into the managed ingestion and retrieval API

Currently, LlamaParse primarily supports PDFs with tables, but they are also building out better support for figures, and an expanded set of the most popular document types: .docx, .pptx, .html as a part of the next enhancements.

Code Implementation:

  1. Install required dependencies:
    a) Create requirements.txt in the root of your project and add these dependencies:
    o llama-index
    o llama-parse
    o python-dotenv
    b) Run the command from terminal “pip install -r requirements.txt” to download and install the above-mentioned dependencies from the folder where you have “requirements.txt”.
  2. Set up the environment variables:
    a) Create “.env” file in the root of your project.
    b) Add these variables there:
    LLAMA_CLOUD_API_KEY = “PassYourLLAMACloudAPIKey”
    OPENAI_API_KEY = “PassYourOpenAIAPIKey”
  3. Create a folder named “data” in the root of your project and add a pdf file that you would like to read/parse like: 

  4. Create a file “demo.py” and add this code. Note* I have added 2 queries at the end related to the pdf that I have provided under the “data” folder, you should update queries based on the pdf that you are keeping under the “data” folder.

 

from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()


from llama_parse import LlamaParse
document=LlamaParse(result_type="markdown").load_data("./data/MLFwk.pdf")

from llama_index.core import VectorStoreIndex
llama_parse_inex=VectorStoreIndex.from_documents(document)

llama_parse_query_engine=llama_parse_inex.as_query_engine()

print(llama_parse_query_engine.query
      (
          "What are the three key factors that drive AI booming?"
       
      ))

print(llama_parse_query_engine.query
      (
          "Is Neural Networks allowed for 'ML framework (PyTorch, TensorFlow) framework?'"
       
      ))

 

 
  1. Execute the demo.py using the command “python .\demo.py

Just for your reference, it gave the correct response as this was there in my “MLFwk.pdf” that I kept under the data folder:





Conclusion:

From the above implementation, we can conclude that LlamaParse is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.


Comments

Popular posts from this blog

How to Unzip files in Selenium (Java)?

1) Using Java (Lengthy way) : Create a utility and use it:>> import java.io.BufferedOutputStream; import org.openqa.selenium.io.Zip; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream;   public class UnzipUtil {     private static final int BUFFER_SIZE = 4096;     public void unzip (String zipFilePath, String destDirectory) throws IOException {         File destDir = new File(destDirectory);         if (!destDir.exists()) {             destDir.mkdir();         }         ZipInputStream zipIn = new ZipInputStream(new FileInputStream(zipFilePath));         ZipEntry entry = zipIn.getNextEntry();         // to iterates over entries in the zip folder         while (en...

The use of Verbose attribute in testNG or POM.xml (maven-surefire-plugin)

At times, we see some weird behavior in your testNG execution and feel that the information displayed is insufficient and would like to see more details. At other times, the output on the console is too verbose and we may want to only see the errors. This is where a verbose attribute can help you- it is used to define the amount of logging to be performed on the console. The verbosity level is 0 to 10, where 10 is most detailed. Once you set it to 10, you'll see that console output will contain information regarding the tests, methods, and listeners, etc. <suite name="Suite" thread-count="5" verbose="10"> Note* You can specify -1 and this will put TestNG in debug mode. The default level is 0. Alternatively, you can set the verbose level through attribute in "maven-surefire-plugin" in pom.xml, as shown in the image. #testNG #automationTesting #verbose # #testAutomation

ChainTest Reporting Framework (ExtentReports is being sunset)

  ExtentReports is being sunset and will be replaced by ChainTest Reporting Framework. 1. What is ChainTest? ChainTest is an open-source test reporting and analytics framework designed to enhance the way QA teams manage and analyze their test results. Think of it as a central hub for all your test data, offering: Real-time analytics : Stay updated with the latest test outcomes. Historical reports : Compare results over time to identify trends and improve processes. Static reports : Easily share detailed HTML reports with stakeholders. Multi-project dashboards : Consolidate data from different projects into a single view. This tool is designed to integrate seamlessly into your existing testing workflows, providing both flexibility and scalability. 2. Why Use ChainTest? Let’s talk about why ChainTest stands out: Enhanced Visibility : Gain clear insights into your test execution metrics, helping your tea...