Python Code Explanation

process_pdf.py

load_and_split_pdfs Function

The load_and_split_pdfs function is responsible for loading the content of multiple PDF files and splitting them into smaller chunks. It accepts the following arguments: pdf_paths, chunk_size, and chunk_overlap.

First, the function initializes an empty list called documents to store the text extracted from the PDF files. It then iterates over each file path in the pdf_paths list.

For each PDF file, it checks if the file exists using the os.path.exists() function. If the file does not exist, a FileNotFoundError is raised to inform the user.

If the file exists, the function uses the PyPDFLoader from the langchain library to load the PDF's content. The loader extracts text from the file, and the resulting documents are appended to the documents list.

After all PDF files are loaded, the function uses RecursiveCharacterTextSplitter to split the content into smaller chunks. The chunk size is controlled by the chunk_size parameter (default: 1000 characters), and overlapping text chunks are handled by the chunk_overlap parameter (default: 200 characters).

Finally, the function returns a list of split documents for further processing, such as summarization or translation.

def load_and_split_pdfs(pdf_paths, chunk_size=1000, chunk_overlap=200):
    documents = []
    for pdf_path in pdf_paths:
        if not os.path.exists(pdf_path):
            raise FileNotFoundError(f"The file {pdf_path} does not exist.")
        loader = PyPDFLoader(pdf_path)
        docs = loader.load()
        documents.extend(docs)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return text_splitter.split_documents(documents)

Save_processing_results Function

The save_processing_results function saves the results from processing the PDF documents into a text file. It accepts a list of results and an optional file name for output.

The function opens the output file in write mode. It iterates over the results list, where each item represents the processed data for a document (e.g., summary, translation, or extracted emails).

For each document, the function writes the document number, its summary, translation, and any extracted emails. After each document's results, a separator line is added to ensure the results are clearly formatted.

This function helps save the results from the PDF processing for later reference or review.

def save_processing_results(results, output_file="results.txt"):
    with open(output_file, 'w') as f:
        for i, result in enumerate(results):
            f.write(f"Document {i + 1}:\n")
            f.write(f"Summary: {result['Summary']}\n")
            f.write(f"Translation: {result['Translation']}\n")
            f.write(f"Emails: {result['Emails']}\n")
            f.write("-" * 50 + "\n")

Summary

In summary, the load_and_split_pdfs function loads and processes PDF files by splitting their content into smaller text chunks. These chunks are easier to handle by language models for tasks such as summarization, translation, or question answering. The save_processing_results function stores the results in a text file, ensuring that the output of the processing is well-organized and accessible for later use.

question_handler.py

get_question_answer_chain Function

The get_question_answer_chain function is responsible for setting up a chain to answer questions based on a given text. It takes one argument, llm, which represents the language model used to generate the answers.

Inside the function, the question_answer_prompt is defined, containing a template where a question and a text are dynamically inserted. This template serves as a prompt for the language model.

The PromptTemplate class from langchain is used to create a template that takes two input variables: question and text. These variables are injected into the prompt when running the chain.

Finally, the function returns an instance of LLMChain, which combines the language model and the question-answer prompt. This chain is ready to process inputs for answering questions.

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

def get_question_answer_chain(llm):
    question_answer_prompt = """
    Answer the following question based on the provided text:
    Question: {question}
    Text: {text}
    """
    question_prompt_template = PromptTemplate(template=question_answer_prompt, input_variables=["question", "text"])
    return LLMChain(llm=llm, prompt=question_prompt_template)

answer_question Function

The answer_question function takes three arguments: question_chain, question, and text. This function runs the language model chain to generate an answer based on the provided question and text.

The function calls question_chain.run() with a dictionary containing the question and text as key-value pairs. The chain processes the input and returns an answer generated by the language model.

This function is a simple interface that allows users to pass a question and corresponding text to the chain, obtaining a language model-generated answer.

def answer_question(question_chain, question, text):
    return question_chain.run({"question": question, "text": text})

Summary

In summary, the get_question_answer_chain function creates a question-answering system by using a language model and a structured prompt. The answer_question function runs this chain by passing in a question and text, returning an answer generated by the model.

summarizer.py

get_summary_chain Function

The get_summary_chain function creates a chain for summarizing a given text using a language model. It accepts one argument, llm, which stands for the language model responsible for generating summaries.

Inside the function, a summary_prompt is defined as a template for summarization. The prompt asks the language model to summarize the provided text within a 100-word limit.

The PromptTemplate class from the langchain library is used to define the template, with text being the only input variable. This allows dynamic injection of different texts into the prompt.

The function returns an instance of LLMChain, which connects the language model with the summarization prompt. This chain can be used to generate concise summaries of input texts.

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

def get_summary_chain(llm):
    summary_prompt = """
    Summarize the following text (maximum 100 words):
    Text: {text}
    """
    summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"])
    return LLMChain(llm=llm, prompt=summary_prompt_template)

summarize_document Function

The summarize_document function runs the summarization chain to generate a summary for a given document. It accepts two arguments: summary_chain, which is the chain created using the get_summary_chain function, and doc_content, which is the content of the document to be summarized.

This function calls the run() method of the summary_chain, passing the document's content as a dictionary with the text key. The chain processes the input and returns a summarized version of the document.

By using this function, users can easily generate concise summaries of large documents, making the content easier to review or share.

def summarize_document(summary_chain, doc_content):
    return summary_chain.run({"text": doc_content})

Summary

In summary, the get_summary_chain function sets up a summarization process using a language model, while the summarize_document function runs this process to generate summaries for any provided text. This approach helps users create concise and meaningful overviews of larger documents.

translator.py

get_translation_chain Function

The get_translation_chain function creates a chain for translating text into English using a language model. It accepts one argument, llm, which stands for the language model responsible for the translation task.

Inside the function, a translation_prompt is defined as a template. This prompt asks the language model to translate the provided text into English.

The PromptTemplate class from the langchain library is used to define the translation prompt template, with text being the only input variable. This allows different texts to be passed into the translation prompt dynamically.

The function returns an LLMChain that links the language model with the translation prompt. This chain will handle the translation task for any given text.

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

def get_translation_chain(llm):
    translation_prompt = """
    Translate the following text into English:
    Text: {text}
    """
    translation_prompt_template = PromptTemplate(template=translation_prompt, input_variables=["text"])
    return LLMChain(llm=llm, prompt=translation_prompt_template)

translate_text Function

The translate_text function is responsible for translating a given piece of text using the translation chain created by the get_translation_chain function. It takes two arguments: translation_chain, which represents the translation model, and text, the content to be translated.

The function calls the run() method of the translation_chain, passing the input text in a dictionary format with the text key. The model processes the input and returns the translated text.

This function allows easy translation of any non-English text into English using a language model, making multilingual documents accessible.

def translate_text(translation_chain, text):
    return translation_chain.run({"text": text})

Summary

In summary, the get_translation_chain function sets up the translation process using a language model, while the translate_text function leverages this chain to translate any text into English. This system is useful for translating multilingual documents into a common language for easier understanding and processing.

Document Processing and Question Answering Application

The main.py file sets up a user-friendly interface for document processing through the Streamlit framework. It allows users to upload PDFs, select a task (summarization, translation, or question answering), and choose a language model for text processing. By using temporary storage and background processing, the application ensures efficient handling of PDF documents.

1. UI Components

The user interface is divided into two columns: one for uploading PDF files and the other for selecting tasks and models. The layout is created using Streamlit's column feature. It also includes custom CSS for a modern and intuitive look.

col1, col2 = st.columns(2)

with col1:
    st.markdown("<h2>📂 Upload PDF(s)</h2>", unsafe_allow_html=True)
    pdf_files = st.file_uploader("Upload PDF(s)", accept_multiple_files=True, type=['pdf'], label_visibility="collapsed")

with col2:
    st.markdown("<h2>🔧 Choose Action</h2>", unsafe_allow_html=True)
    action = st.selectbox(
        "Choose the action you want to perform:",
        ("Summarize", "Translate", "Ask a Question")
    )

    model_choice = st.selectbox(
        "Choose the model to use:",
        ("Llama 3.1", "Llama 2", "Mistral", "CodeLlama")
    )

2. Model Selection

The language model is initialized based on the user's selection. Options include "Llama 3.1", "Llama 2", "Mistral", and "CodeLlama". These models are loaded using the Ollama class, which points to a local server for inference.

if model_choice == "Llama 3.1":
    llm = Ollama(model="llama3.1", base_url="http://localhost:11434")
elif model_choice == "Llama 2":
    llm = Ollama(model="llama2", base_url="http://localhost:11434")
elif model_choice == "Mistral":
    llm = Ollama(model="mistral", base_url="http://localhost:11434")
elif model_choice == "CodeLlama":
    llm = Ollama(model="codellama", base_url="http://localhost:11434")

3. PDF Upload and Processing

Once the user uploads PDF files, they are stored temporarily and processed. The content is split into manageable chunks using the load_and_split_pdfs function. This ensures that the application can handle large documents efficiently.

if pdf_files:
    st.markdown("<h2>🛠 Processing PDFs...</h2>", unsafe_allow_html=True)

    pdf_paths = []
    with tempfile.TemporaryDirectory() as temp_dir:
        for pdf_file in pdf_files:
            temp_pdf_path = f"{temp_dir}/{pdf_file.name}"
            with open(temp_pdf_path, "wb") as f:
                f.write(pdf_file.read())
            pdf_paths.append(temp_pdf_path)

    docs = load_and_split_pdfs(pdf_paths)

4. Task Execution

The application then performs the task selected by the user. Depending on the task, it either summarizes, translates, or answers questions related to the document's content. The processing is handled in the background using ThreadPoolExecutor, allowing for concurrent processing of multiple PDFs.

def process_document(doc):
    try:
        if action == "Summarize":
            summary = summarize_document(summary_chain, doc.page_content)
            return {"Result": summary}
        elif action == "Translate":
            summary = summarize_document(summary_chain, doc.page_content)
            translation = translate_text(translation_chain, summary)
            return {"Result": translation}
        elif action == "Ask a Question" and question:
            answer = answer_question(question_chain, question, doc.page_content)
            return {"Result": answer}
        else:
            return {"Result": "No valid input provided."}
    except Exception as e:
        return {"Result": f"Error: {str(e)}"}

5. Displaying Results

Once the documents have been processed, the results are displayed in expandable sections, allowing the user to view the output for each PDF. This keeps the interface clean and organized, especially when multiple documents are being processed.

for i, result in enumerate(results):
    with st.expander(f"Document {i + 1} - Result"):
        st.write(result["Result"])

Summary

In summary, the main.py script provides a complete workflow for processing PDF documents using different language models. It enables summarization, translation, and question-answering, offering a user-friendly interface through Streamlit, while leveraging powerful language models for natural language processing tasks.