Make you have access to a GPU
Install Singularity with GPU support:
Pull the singularity container from singularity hub OR build it yourself
singularity pull shub://ml4ai/UA-hpc-containers:im2markup
Download Data & Scripts from Deng et al., (2017):
Python and Lua scripts can be cloned from the im2markup GitHub repository:
git clone https://github.com/harvardnlp/im2markup.git
cd im2markup
Preprocess Data:
This library expects to have the gold formulas (formulas.norm.lst
in the script below), even during inference because the evaluation code is always called. We can avoid this requirement by provided random strings as the gold formulas so that the input is in the right format, but we don’t cheat (and we don’t have to know the answer before we ask the question)!
The format of this random string formula file we generate is:
x x x
x x x
x x x
...
The library also wants to have list of the images to be decoded (test_filter.lst
in the script below).
The format of this file is:
equation0.png 0
equation1.png 1
equation2.png 2
equation3.png 3
...
We generate these files in the process of formatting the images with ImageMagick using this bash script (please modify as needed).
Then, preprocess the images:
cd im2markup
python scripts/preprocessing/preprocess_images.py --input-dir <path_to_images> --output-dir <path_to_output_images_processed>
Download the Trained Model:
mkdir -p model/latex
wget -P model/latex/ http://lstm.seas.harvard.edu/latex/model/latex/final-model
###Running the model on your equation images
At the University of Arizona, we run the code on the HPC. This is the PBS script we use to do so. Please modify to meet your needs.
script: ~/script_test.pbs
# Your job will use 1 node, 28 cores, and 224gb of memory total.
#PBS -q standard
#PBS -l select=1:ncpus=28:mem=224gb:ngpus=1
### Specify a name for the job
#PBS -N im2latex
### Specify the group name
#PBS -W group_list=claytonm
### Used if job requires partial node only
#PBS -l place=pack:exclhost
### CPUtime required in hhh:mm:ss.
### Leading 0's can be omitted e.g 48:0:0 sets 48 hours
#PBS -l cput=00:28:00
### Walltime is how long your job will run
#PBS -l walltime=00:02:00
### Joins standard error and standard out
#PBS -o im2latex.o
#PSB -e im2latex.e
### Sends e-mail when job aborts/ends
#PBS -m ae
#PBS -M username@email.arizona.edu
### Checkpoints job after c (integer min) CPU time
#PBS -c c=2
##########################################################
module load singularity
### DIRECTORIES / NAMES ###
USER=$(whoami)
CONTAINER=containers/UA-hpc-containers_torch.sif
MODEL=model/latex
DATA=data/sample
IMAGES=$DATA/images_processed
RESULTS=$DATA/results
LOGS=$DATA/logs
LOGNAME=im2latex
cd /extra/$USER/im2markup/
mkdir --parents $RESULTS $LOGS
##########################################################
date +"Start - %a %b %e %H:%M:%S %Z %Y"
singularity exec --nv $CONTAINER th src/train.lua \
-phase test \
-gpu_id 1 \
-load_model \
-visualize \
-model_dir $MODEL \
-data_base_dir $IMAGES \
-output_dir $RESULTS \
-data_path $DATA/test_filter.lst \
-label_path $DATA/formulas.norm.lst \
-max_num_tokens 500 \
-max_image_width 800 \
-max_image_height 800 \
-batch_size 5 \
-beam_size 5
date +"End - %a %b %e %H:%M:%S %Z %Y"
# Renames Log Files
export STAMP=$(date +"%Y%m%d_%H%M%S")
mv log.txt $LOGS/$LOGNAME_$STAMP.log
We submit this job using the above script on the HPC:
qsub script_test.pbs
The results (i.e., predicted LaTeX equations) will be in the RESULTS
directory specificed in the pbs script.