Skip to main content

LlamaParse: Incredibly good at parsing PDFs

 What is LlamaParse?

LlamaParse is a proprietary parsing service that is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.

It directly integrates with LlamaIndex ingestion and retrieval to let you build retrieval over complex, semi-structured documents. It is promised to be able to answer complex questions that weren’t possible previously. This service is available in a public preview mode: available to everyone, but with a usage limit (1k pages per day) with 7,000 free pages per week. Then $0.003 per page ($3 per 1,000 pages). It operates as a standalone service that can also be plugged into the managed ingestion and retrieval API

Currently, LlamaParse primarily supports PDFs with tables, but they are also building out better support for figures, and an expanded set of the most popular document types: .docx, .pptx, .html as a part of the next enhancements.

Code Implementation:

  1. Install required dependencies:
    a) Create requirements.txt in the root of your project and add these dependencies:
    o llama-index
    o llama-parse
    o python-dotenv
    b) Run the command from terminal “pip install -r requirements.txt” to download and install the above-mentioned dependencies from the folder where you have “requirements.txt”.
  2. Set up the environment variables:
    a) Create “.env” file in the root of your project.
    b) Add these variables there:
    LLAMA_CLOUD_API_KEY = “PassYourLLAMACloudAPIKey”
    OPENAI_API_KEY = “PassYourOpenAIAPIKey”
  3. Create a folder named “data” in the root of your project and add a pdf file that you would like to read/parse like: 

  4. Create a file “demo.py” and add this code. Note* I have added 2 queries at the end related to the pdf that I have provided under the “data” folder, you should update queries based on the pdf that you are keeping under the “data” folder.

 

from dotenv import load_dotenv
load_dotenv()

import nest_asyncio
nest_asyncio.apply()


from llama_parse import LlamaParse
document=LlamaParse(result_type="markdown").load_data("./data/MLFwk.pdf")

from llama_index.core import VectorStoreIndex
llama_parse_inex=VectorStoreIndex.from_documents(document)

llama_parse_query_engine=llama_parse_inex.as_query_engine()

print(llama_parse_query_engine.query
      (
          "What are the three key factors that drive AI booming?"
       
      ))

print(llama_parse_query_engine.query
      (
          "Is Neural Networks allowed for 'ML framework (PyTorch, TensorFlow) framework?'"
       
      ))

 

 
  1. Execute the demo.py using the command “python .\demo.py

Just for your reference, it gave the correct response as this was there in my “MLFwk.pdf” that I kept under the data folder:





Conclusion:

From the above implementation, we can conclude that LlamaParse is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.


Comments

Popular posts from this blog

ARIA Snapshot in Playwright

  What is an ARIA Snapshot in Playwright? An  ARIA snapshot  in Playwright is a structured representation of a page’s  accessibility tree , which is used by assistive technologies (e.g., screen readers) to interpret the content of a web page. This snapshot helps verify if elements have the correct  roles, names, and properties  required for accessibility. Playwright provides the page.accessibility.snapshot() API to capture this accessibility tree at any given moment during test execution. How Does ARIA Work? ARIA ( Accessible Rich Internet Applications ) is a set of attributes that help improve accessibility by defining roles, states, and properties for elements that are not natively accessible. Example: In this case, the aria-label ensures that screen readers identify the button as “Submit Form.” How to Use ARIA Snapshots in Playwright? Playwright’s  accessibility.snapshot()   method retrieves the  accessible structure  of the page. Ex...

Bruno vs Postman: Which API Client Should You Choose?

  As API testing becomes more central to modern software development, the tools we use to test, automate, and debug APIs can make a big difference. For years, Postman has been the go-to API client for developers and testers alike. But now, Bruno , a relatively new open-source API client, is making waves in the community. Let’s break down how Bruno compares to Postman and why you might consider switching or using both depending on your use case. ✨ What is Bruno? Bruno is an open-source, Git-friendly API client built for developers and testers who prefer simplicity, speed, and local-first development. It stores your API collections as plain text in your repo, making it easy to version, review, and collaborate on API definitions. 🌟 What is Postman? Postman is a full-fledged API platform that offers everything from API testing, documentation, and automation to mock servers and monitoring. It comes with a polished UI, robust integration, and support for collaborati...

🔧 Self-Healing Selenium Automation with Java — A Smarter Way to Handle Broken Locators

  How to build smarter, more resilient automated tests? We’ve all been there — our Selenium test cases start failing because of minor UI changes like updated element IDs, renamed classes, or even reordered elements. It’s frustrating, time-consuming, and often the most dreaded part of maintaining automated tests. But what if your automation could heal itself? 💡 What is Self-Healing Automation? Self-healing automation  refers to the capability of a test automation framework to recover from minor UI changes by automatically trying alternative locators when the primary one fails. It’s like giving your test scripts a survival instinct. 🔨 🛠️ Implementation in Java + Selenium: Step by Step Step 1: Create a Self-Healing Wrapper We start by creating a custom class called SelfHealingDriver. This class wraps the standard WebDriver and handles locator failures gracefully. public   class   SelfHealingDriver { private   WebDriver driver ; public   SelfHealingDri...