This project is a PDF parser that allows you to extract text from PDF exam file and return the PaperExam ID.
To use this project, follow these steps:
git clone [email protected]:dica-solution/PDF_Parser.gitcd PDF_Parserpoetry installpoetry shell- Set up the Environment variables in the
.env.dev
- Prepare the PDF exam and save it in a folder, for example:
pdf_parser/pdf_exams/exam_001/exam.pdf - Run the following command, wait for the text extraction process, finally you will get an ID of
PaperExam(a system of storing structured exam).
python scripts/pipeline.py --pdf_file_path `path to pdf file` --prompt_collection_path `path to prompt collection file`