Download the complete set of arXiv PDFs and their corresponding source files from Amazon S3 as described here. Our dataset contains papers up through December 2018.
Download docker
Clone the AutoMATES repo
Expand and normalize the directory structure
Expand the src files:
—keepall
if you want to keep intermediate files. We use this with the pdf directory expansion because otherwise the newly expanded pdfs would be deleted!cd automates/equation_extraction
python expand_arxiv.py
Expand the pdf files (optional, but makes subsequent steps WAY faster!)
python expand_arxiv.py <path_to_arxiv_pdf_dir> data/arxiv/pdf —keepall
Build the docker container for processing the LaTeX documents
docker build -t clulab/equations .
Collect the data for the separate equation detection and decoding tasks:
./docker.sh ./run_data_collection.sh --indir=/data/arxiv/src --outdir=/data/arxiv/output --pdfdir=/data/arxiv/pdf --rescale-factor=0.5 --dump-pages --nproc=2 --logfile=/data/arxiv/logfile
Takes several arguments: