====== Tesseract OCR ======
  * https://github.com/tesseract-ocr/tesseract
  * [[pdf:ocrmypdf|ocrmypdf]]

===== 설치 =====
==== homebrew ====
  * 가장 최신버전 설치 가능

<code sh>
brew install tesseract tesseract-lang
</code>

==== ubuntu ====

<code sh>
sudo apt install  tesseract-ocr-kor tesseract-ocr-kor-vert gscan2pdf
</code>

  * ''ocrmypdf'', ''gscan2pdf'' 와 함께 사용하면 [[:pdf|PDF]], image 인식 등을 진행할 수 있다.

===== 인식 데이터 =====
  * [[https://github.com/tesseract-ocr/tessdata_fast|tessdata_fast]] 빠른 익식용 데이터. [[linux:debian|Debian Linux]] / [[linux:ubuntu|Ubuntu Linux]]에서 패키지로 설치시 기본으로 깔리는 듯.
  * [[https://github.com/tesseract-ocr/tessdata_best|tessdata_best]] : LSTM 모델. 이게 더 나은 것인듯.
  * ''%%--tessdata-dir%% <PATH>'' 옵션으로 지정 혹은 ''TESSDATA_PREFIX'' 환경변수로 지정
  * best 예

<code sh>
#cd ~/.config
#git clone --recursive --depth=1 https://github.com/tesseract-ocr/tessdata_best.git
#원하는 것들만 받기

mkdir -p ~/.config/tessdata_best

wget -O ~/.config/tessdata_best/kor.traineddata https://github.com/tesseract-ocr/tessdata_best/raw/main/kor.traineddata
wget -O ~/.config/tessdata_best/eng.traineddata https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata

export TESSDATA_PREFIX=$HOME/.config/tessdata_best

</code>
===== 인식률 =====
  * 한글 인식률이 좋지 못한 편이다.
  * 여러 언어보다는 단일 언어로 인식하는게 인식률이 더 좋다.
  * [[https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html|Imporoving the quality of the output]]
  * 300DPI 이상 추천
  * 불필요한 테두리를 crop 하고서 인식해야 인식률이 좋아진다.
  * deskewing : 비스듬히 스캔된 것을 똑바로 세워서 인식해야 인식률이 좋아진다.