Skip to main content
Open In ColabOpen on GitHub

How to create a custom Document Loader

Overview​

Applications based on LLMs frequently entail extracting data from databases or files, like PDFs, and converting it into a format that LLMs can utilize. In LangChain, this usually involves creating Document objects, which encapsulate the extracted text (page_content) along with metadataβ€”a dictionary containing details about the document, such as the author's name or the date of publication.

Document objects are often formatted into prompts that are fed into an LLM, allowing the LLM to use the information in the Document to generate a desired response (e.g., summarizing the document). Documents can be either used immediately or indexed into a vectorstore for future retrieval and use.

The main abstractions for Document Loading are:

ComponentDescription
DocumentContains text and metadata
BaseLoaderUse to convert raw data into Documents
BlobA representation of binary data that's located either in a file or in memory
BaseBlobParserLogic to parse a Blob to yield Document objects

This guide will demonstrate how to write custom document loading and file parsing logic; specifically, we'll see how to:

  1. Create a standard document Loader by sub-classing from BaseLoader.
  2. Create a parser using BaseBlobParser and use it in conjunction with Blob and BlobLoaders. This is useful primarily when working with files.

Standard Document Loader​

A document loader can be implemented by sub-classing from a BaseLoader which provides a standard interface for loading documents.

Interface​

Method NameExplanation
lazy_loadUsed to load documents one by one lazily. Use for production code.
alazy_loadAsync variant of lazy_load
loadUsed to load all the documents into memory eagerly. Use for prototyping or interactive work.
aloadUsed to load all the documents into memory eagerly. Use for prototyping or interactive work. Added in 2024-04 to LangChain.
  • The load methods is a convenience method meant solely for prototyping work -- it just invokes list(self.lazy_load()).
  • The alazy_load has a default implementation that will delegate to lazy_load. If you're using async, we recommend overriding the default implementation and providing a native async implementation.
important

When implementing a document loader do NOT provide parameters via the lazy_load or alazy_load methods.

All configuration is expected to be passed through the initializer (init). This was a design choice made by LangChain to make sure that once a document loader has been instantiated it has all the information needed to load documents.

Installation​

Install langchain-core and langchain_community.

%pip install -qqU langchain_core langchain_community
Note: you may need to restart the kernel to use updated packages.

Implementation​

Let's create an example of a standard document loader that loads a file and creates a document from each line in the file.

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document


class CustomDocumentLoader(BaseLoader):
"""An example document loader that reads a file line by line."""

def __init__(self, file_path: str) -> None:
"""Initialize the loader with a file path.

Args:
file_path: The path to the file to load.
"""
self.file_path = file_path

def lazy_load(self) -> Iterator[Document]: # <-- Does not take any arguments
"""A lazy loader that reads a file line by line.

When you're implementing lazy load methods, you should use a generator
to yield documents one by one.
"""
with open(self.file_path, encoding="utf-8") as f:
line_number = 0
for line in f:
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": self.file_path},
)
line_number += 1

# alazy_load is OPTIONAL.
# If you leave out the implementation, a default implementation which delegates to lazy_load will be used!
async def alazy_load(
self,
) -> AsyncIterator[Document]: # <-- Does not take any arguments
"""An async lazy loader that reads a file line by line."""
# Requires aiofiles
# Install with `pip install aiofiles`
# https://github.com/Tinche/aiofiles
import aiofiles

async with aiofiles.open(self.file_path, encoding="utf-8") as f:
line_number = 0
async for line in f:
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": self.file_path},
)
line_number += 1
API Reference:BaseLoader | Document

Test πŸ§ͺ​

To test out the document loader, we need a file with some quality content.

with open("./meow.txt", "w", encoding="utf-8") as f:
quality_content = "meow meow🐱 \n meow meow🐱 \n meow😻😻"
f.write(quality_content)

loader = CustomDocumentLoader("./meow.txt")
%pip install -q aiofiles
Note: you may need to restart the kernel to use updated packages.
## Test out the lazy load interface
for doc in loader.lazy_load():
print()
print(type(doc))
print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱
' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱
' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}
## Test out the async implementation
async for doc in loader.alazy_load():
print()
print(type(doc))
print(doc)

<class 'langchain_core.documents.base.Document'>
page_content='meow meow🐱
' metadata={'line_number': 0, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow meow🐱
' metadata={'line_number': 1, 'source': './meow.txt'}

<class 'langchain_core.documents.base.Document'>
page_content=' meow😻😻' metadata={'line_number': 2, 'source': './meow.txt'}
tip

load() can be helpful in an interactive environment such as a jupyter notebook.

Avoid using it for production code since eager loading assumes that all the content can fit into memory, which is not always the case, especially for enterprise data.

loader.load()
[Document(metadata={'line_number': 0, 'source': './meow.txt'}, page_content='meow meow🐱 \n'),
Document(metadata={'line_number': 1, 'source': './meow.txt'}, page_content=' meow meow🐱 \n'),
Document(metadata={'line_number': 2, 'source': './meow.txt'}, page_content=' meow😻😻')]

Working with Files​

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use open to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.

BaseBlobParser​

A BaseBlobParser is an interface that accepts a blob and outputs a list of Document objects. A blob is a representation of data that lives either in memory or in a file. LangChain python has a Blob primitive which is inspired by the Blob WebAPI spec.

from langchain_core.document_loaders import BaseBlobParser, Blob


class MyParser(BaseBlobParser):
"""A simple parser that creates a document from each line."""

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Parse a blob into a document line by line."""
line_number = 0
with blob.as_bytes_io() as f:
for line in f:
line_number += 1
yield Document(
page_content=line,
metadata={"line_number": line_number, "source": blob.source},
)
API Reference:BaseBlobParser | Blob
blob = Blob.from_path("./meow.txt")
parser = MyParser()
list(parser.lazy_parse(blob))
[Document(metadata={'line_number': 1, 'source': './meow.txt'}, page_content='meow meow🐱 \n'),
Document(metadata={'line_number': 2, 'source': './meow.txt'}, page_content=' meow meow🐱 \n'),
Document(metadata={'line_number': 3, 'source': './meow.txt'}, page_content=' meow😻😻')]

Using the blob API also allows one to load content directly from memory without having to read it from a file!

blob = Blob(data=b"some data from memory\nmeow")
list(parser.lazy_parse(blob))
[Document(metadata={'line_number': 1, 'source': None}, page_content='some data from memory\n'),
Document(metadata={'line_number': 2, 'source': None}, page_content='meow')]

Blob​

Let's take a quick look through some of the Blob API.

blob = Blob.from_path("./meow.txt", metadata={"foo": "bar"})
blob.encoding
'utf-8'
blob.as_bytes()
b'meow meow\xf0\x9f\x90\xb1 \n meow meow\xf0\x9f\x90\xb1 \n meow\xf0\x9f\x98\xbb\xf0\x9f\x98\xbb'
blob.as_string()
'meow meow🐱 \n meow meow🐱 \n meow😻😻'
blob.as_bytes_io()
<contextlib._GeneratorContextManager at 0x74b8d42e9940>
blob.metadata
{'foo': 'bar'}
blob.source
'./meow.txt'

Blob Loaders​

While a parser encapsulates the logic needed to parse binary data into documents, blob loaders encapsulate the logic that's necessary to load blobs from a given storage location.

At the moment, LangChain supports FileSystemBlobLoader and CloudBlobLoader.

You can use the FileSystemBlobLoader to load blobs and then use the parser to parse them.

from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader

filesystem_blob_loader = FileSystemBlobLoader(
path=".", glob="*.mdx", show_progress=True
)
API Reference:FileSystemBlobLoader
%pip install -q tqdm
Note: you may need to restart the kernel to use updated packages.
parser = MyParser()
for blob in filesystem_blob_loader.yield_blobs():
for doc in parser.lazy_parse(blob):
print(doc)
break
/home/mame/PycharmProjects/langchain-pr/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:00<00:00, 4229.35it/s]
``````output
page_content='# Text embedding models
' metadata={'line_number': 1, 'source': 'embed_text.mdx'}
page_content='---
' metadata={'line_number': 1, 'source': 'installation.mdx'}
page_content='---
' metadata={'line_number': 1, 'source': 'index.mdx'}
page_content='# How to load Microsoft Office files
' metadata={'line_number': 1, 'source': 'document_loader_office_file.mdx'}
page_content='# How to create and query vector stores
' metadata={'line_number': 1, 'source': 'vectorstores.mdx'}
page_content='---
' metadata={'line_number': 1, 'source': 'toolkits.mdx'}
page_content='# How to load JSON
' metadata={'line_number': 1, 'source': 'document_loader_json.mdx'}

Or, you can use CloudBlobLoader to load blobs from a cloud storage location (Supports s3://, az://, gs://, file:// schemes).

%pip install -q 'cloudpathlib[s3]'
Note: you may need to restart the kernel to use updated packages.
from cloudpathlib import S3Client, S3Path
from langchain_community.document_loaders.blob_loaders import CloudBlobLoader

client = S3Client(no_sign_request=True)
client.set_as_default_client()

path = S3Path(
"s3://bucket-01", client=client
) # Supports s3://, az://, gs://, file:// schemes.

cloud_loader = CloudBlobLoader(path, glob="**/*.pdf", show_progress=True)

for blob in cloud_loader.yield_blobs():
print(blob)
API Reference:CloudBlobLoader
 17%|β–ˆβ–‹        | 1/6 [00:04<00:20,  4.20s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/Annual-Report-2016.pdf'
``````output
33%|β–ˆβ–ˆβ–ˆβ–Ž | 2/6 [00:05<00:09, 2.28s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/ComingHomeToNature_ActivityBooklet.pdf'
``````output
50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 3/6 [00:06<00:06, 2.01s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/ComingHomeToNature_ActivityBookletFoyles.pdf'
``````output
67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 4/6 [00:07<00:02, 1.44s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/EVENTS E-POSTER_DAYS OF AWE.pdf'
``````output
83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/6 [00:07<00:01, 1.11s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/MH.pdf'
``````output
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:08<00:00, 1.02s/it]
``````output
metadata={} mimetype='application/pdf' path='s3://bucket-01/SRT Annual Report 2018.pdf'
``````output
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:11<00:00, 1.87s/it]

Generic Loader​

LangChain has a GenericLoader abstraction which composes a BlobLoader with a BaseBlobParser.

GenericLoader is meant to provide standardized classmethods that make it easy to use existing BlobLoader implementations. At the moment, the FileSystemBlobLoader and CloudBlobLoader are supported.

from langchain_community.document_loaders.generic import GenericLoader

generic_loader_filesystem = GenericLoader(
blob_loader=filesystem_blob_loader, blob_parser=parser
)
for idx, doc in enumerate(generic_loader_filesystem.lazy_load()):
if idx < 5:
print(doc)

print("... output truncated for demo purposes")
API Reference:GenericLoader
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:00<00:00, 1224.82it/s]
``````output
page_content='# Text embedding models
' metadata={'line_number': 1, 'source': 'embed_text.mdx'}
page_content='
' metadata={'line_number': 2, 'source': 'embed_text.mdx'}
page_content=':::info
' metadata={'line_number': 3, 'source': 'embed_text.mdx'}
page_content='Head to [Integrations](/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.
' metadata={'line_number': 4, 'source': 'embed_text.mdx'}
page_content=':::
' metadata={'line_number': 5, 'source': 'embed_text.mdx'}
... output truncated for demo purposes
%pip install -q pypdf2
Note: you may need to restart the kernel to use updated packages.
from io import BytesIO

from langchain_core.document_loaders.base import BaseBlobParser
from langchain_core.documents import Document
from PyPDF2 import PdfReader


class MyPDFParser(BaseBlobParser):
def lazy_parse(self, blob) -> Iterator[Document]:
reader = PdfReader(BytesIO(blob.as_bytes()))
for i, page in enumerate(reader.pages):
text = page.extract_text()
yield Document(
page_content=text, metadata={"page": i, "source": blob.source}
)
API Reference:BaseBlobParser | Document
from langchain_community.document_loaders.generic import GenericLoader

pdf_parser = MyPDFParser()
generic_loader_cloud = GenericLoader(blob_loader=cloud_loader, blob_parser=pdf_parser)
for idx, doc in enumerate(generic_loader_cloud.lazy_load()):
if idx < 5:
print(doc)

print("... output truncated for demo purposes")
API Reference:GenericLoader
  0%|          | 0/6 [00:00<?, ?it/s]
``````output
page_content='THE SIGRID RAUSING TRUST
Annual Report 2015' metadata={'page': 0, 'source': 's3://bucket-01/Annual-Report-2016.pdf'}
page_content='' metadata={'page': 1, 'source': 's3://bucket-01/Annual-Report-2016.pdf'}
page_content='1
SIGRID RAUSING TRUSTContents
Preface
The Sigrid Rausing Trust
Supporting the International Human Rights Movement
Advocacy, Research and LitigationDetention, Torture and the Death Penalty
Human Rights Defenders
Free ExpressionTransitional Justice
Women’s Rights
SRT Grantmaking in 2015 – The Statistics
LGBTI Rights
Xenophobia and Intolerance
Transparency and Accountability
Regional Funds
Miscellaneous Fund
List of Grants
Trustees and Staff
02
04
05
060810
12
1416
1820
2224
26
28
3036
Front cover: Ukraine, Nikishino, Donetsk Oblast:
Destroyed houses in the village of Nikishino,
which was the scene of intense fighting between
pro-Russian separatists and the Ukrainian
army. The armed conflict resulted in the deaths
of thousands of civilians and over one million
internally displaced people. SRT grantee
Ukrainian Women’s Fund has adapted its work in
response to the conflict with a rapid response
grants programme to support women’s rights
organisations to address the impact of the
conflict on women.
Β©Panos Pictures/Iva ZimovaBack cover. Inside front and back cover :
In November 2015, a dam holding toxic waste from
an iron ore mine in Mariana, Brazil, burst creating
a devastating mud slide. The mud engulfed the
local town of Bento Rodrigues, leaving 17 people
dead and hundreds homeless and waterless, and
causing untold damage to the Rio Doce River and
agricultural land, depriving numerous communities
of their livelihoods. SRT grantee JustiΓ§a Global
launched a report on the disaster in January
2016 accusing two of the world’s largest mining
companies of negligence in this case. The
organisation also collaborated with Conectas,
another grantee, to arrange a visit of UN human
rights experts, who denounced the government
and companies involved for β€˜insufficient’ efforts to protect the community.
Β©Daniela Fichino/JustiΓ§a Global' metadata={'page': 2, 'source': 's3://bucket-01/Annual-Report-2016.pdf'}
page_content='PREFACESIGRID RAUSING TRUST
22015 was the 20th anniversary of the Sigrid Rausing Trust. In that
period the Trust has awarded Β£250 million in grants, supporting 860
organisations across the world. We have seen many significant changes
in these years. There is now greater accountability of human rights abusers, with the creation of the International Criminal Court and strengthened legal remedies in many countries. There is a stronger and
more geographically dispersed human rights movement, a much greater
awareness of rights, and human rights activism among communities in the Global South supported by local and national organisations, many of which did not exist twenty years ago. The lesbian and gay rights movement
has turned global, and marriage equality has become established in many
countries. The adoption of the Ruggie principles has signaled a stronger emphasis on transparency of government and business. Finally there is an
increased focus upon the people who endure the most egregious human
rights abuses. In supporting this work the Trust has become a crucial foundation stone for human rights voices in the world.
For twenty years the Trust has supported the human rights movement in a
way that is atypical of many donors: by providing long term core funding,
and placing a special emphasis on leadership and clarity of thought in
our grantees. Providing core support to civil society groups that is often a lifeline makes the Trust a unique part of the funding landscape. We
believe funding is most effective when grantees are free to determine how
they use our funds to promote human rights goals. Too often we see the imagination and energy of civil society leaders eroded by the constant demand to produce project grant applications, predict the future through
the use of log frames and extensive reports - processes that can absorb
valuable time.' metadata={'page': 3, 'source': 's3://bucket-01/Annual-Report-2016.pdf'}
page_content='SIGRID RAUSING TRUSTSIGRID RAUSING TRUST
3Our costs, as a proportion of our total funds are less than 4% annually
- a percentage that we are proud of, and which we achieve without any
sacrifice of our due diligence. We are lucky to have a strong staff team
whose analysis and judgments create a sound platform for the Trustees to make decisions. And of course we have a strong leader in Sigrid herself, whose insights and values have set the course for the Trust over the last
twenty years.
But we also face significant challenges. States are pushing back against
human rights norms with arguments based on traditional values and state sovereignty. Human rights organisations sometimes struggle
to communicate with the public and to bring human rights out of
the courtroom and into the broader public domain. They are often accountable to international funders, and may forget the need to build broad coalitions in their own countries.
When the Arab Spring erupted the Trust invested in supporting the fast
developing human rights movement in the region. Some governments in the region are seeking to crack down on independent expression, and
grants have become much more difficult to process as organisations find
themselves facing increasing legal and bureaucratic hurdles. A similar pattern has been emerging in Russia over the last few years, with its so-called Foreign Agents law. This attempt to close down the civil space
in which NGOs can operate represents a grave threat to the human
rights movement and to broader democratic ideals. It is a threat we are determined to confront as we find ways of continuing our valued support to beleaguered groups in a number of increasingly repressive societies.
Andrew Puddephatt
December 2015' metadata={'page': 4, 'source': 's3://bucket-01/Annual-Report-2016.pdf'}
``````output
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:14<00:00, 2.42s/it]
``````output
... output truncated for demo purposes

Custom Generic Loader​

If you really like creating classes, you can sub-class and create a class to encapsulate the logic together.

You can sub-class from this class to load content using an existing loader.

from typing import Any


class MyCustomLoader(GenericLoader):
@staticmethod
def get_parser(**kwargs: Any) -> BaseBlobParser:
"""Override this method to associate a default parser with the class."""
return MyParser()
loader = MyCustomLoader.from_filesystem(path=".", glob="*.mdx", show_progress=True)

for idx, doc in enumerate(loader.lazy_load()):
if idx < 5:
print(doc)

print("... output truncated for demo purposes")
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:00<00:00, 814.86it/s]
``````output
page_content='# Text embedding models
' metadata={'line_number': 1, 'source': 'embed_text.mdx'}
page_content='
' metadata={'line_number': 2, 'source': 'embed_text.mdx'}
page_content=':::info
' metadata={'line_number': 3, 'source': 'embed_text.mdx'}
page_content='Head to [Integrations](/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.
' metadata={'line_number': 4, 'source': 'embed_text.mdx'}
page_content=':::
' metadata={'line_number': 5, 'source': 'embed_text.mdx'}
... output truncated for demo purposes

Was this page helpful?