[KR_OCR] 데이터수집을 위한 어노테이션

DeepLearning/OCR_

[KR_OCR] 데이터수집을 위한 어노테이션

new_challenge 2019. 4. 22. 23:26

데이터 수집을 위한 어노테이션

_이미지 크롤링 후 TextBox와 Label을 달아주는 작업이 필요

이미지 데이터 수집

AI HUB에서 관광 이미지 데이터 셋 다운로드

http://www.aihub.or.kr/

AI 오픈 이노베이션 허브

AI 챗봇,안면인식 등 지능형 서비스 구현에 활용할 수 있는 지식베이스와 기계학습용 이미지 데이터를 제공합니다.

www.aihub.or.kr

>> 위의 링크에서 관광데이터셋의 매장 전경 데이터셋.

>> 위의 파일 중 직접촬영의 매장전경, 크롤링의 매장전경 데이터셋을 사용하였다.

>> 위의 파일 속 어노테이션은 간판이 아닌 매장에 대한 어노테이션이기 때문에 다시 어노테이션 작업이 필요하다.

자동크롤러 깃에서 가져오기 (GIT CLONE)

- 구글, 네이버에서 간판을 검색 했을 때 나오는 이미지 다운로드

https://github.com/qpark99/AutoCrawler

qpark99/AutoCrawler

Google, Naver image web crawler. Contribute to qpark99/AutoCrawler development by creating an account on GitHub.

github.com

>> 위의 깃허브를 clone하여 구글, 네이버 이미지 크롤러를 사용

>> 검색어로는 간판, 음식점 간판을 주어서 각각 대략 1700개 정도의 이미지를 크롤링.

>> 위와 같이 download 폴더에 크롤링 된 이미지들이 저장된다.

데이터 어노테이션 생성

어노테이션 생성을 위한 프로그램 다운로드

http://www.robots.ox.ac.uk/~vgg/software/via/

VGG Image Annotator (VIA)

VGG Image Annotator (VIA) Overview VGG Image Annotator (VIA) is an image annotation tool that can be used to define regions in an image and create textual descriptions of those regions. VIA is an open source project developed at the Visual Geometry Group a

www.robots.ox.ac.uk

>> 옥스퍼드 대학교에서 제공하는 image annotator.

>> 위의 링크에서 Version2.0 을 다운로드

>> 다운로드 후 압축을 풀면 어노테이션을 생성 할 수 있는 html이 있다.

>> via.html에서 어노테이션 가능

>> 어노테이션 할 이미지를 로딩하고, textbox와 label을 입력해준다.

>> 어노테이션 완료 후 export annotation(json)으로 저장한다.

어노테이션 완료 후 전처리

Loading Library

import json
import os

open annotation file

with open('data/korean_data/annotation.json',encoding='utf-8-sig') as f:
    data = json.load(f)

>> 한글이 있는 파일을 열기 위해서는 인코딩을 해준다.

json 파일 preprocessing

#복사해놓기
data2 = data.copy()

for k in data.keys():
    if len(data[k]['regions']) == 0:
        print(k)
        data2.pop(k)
    else:
        continue

- json 안에서 regions 안에 box값이 들어있기 때문에, 안에 값이 없으면 해당 딕셔너리 제거

for key,value in data.items():
    d =value['regions']
    d_uniq = [i for n, i in enumerate(d) if i not in d[n + 1:]]
    value['regions'] = [i for n,i in enumerate(d_uniq) if d_uniq[n]['region_attributes'] != {}]
    value['regions'] = [i for n,i in enumerate(d_uniq) if d_uniq[n]['region_attributes']['annotation'] != '']

- json 안에서 annotation이 안되어 있는 값이 있으면 해당 딕셔너리 제거

json파일 다시 저장

with open('data/korean_data/annotation.json','w', encoding='utf-8-sig') as file:
    file.write(json.dumps(data2, ensure_ascii=False))

- 다시 저장 할 때는 ensure_ascii = False로 지정해주고 json.dump를 한다.

새로 저장 된 json을 다시 로딩

with open('data/korean_data/annotation.json','r', encoding='utf-8-sig') as file:
    json_file = json.load(file)

어노테이션 된 파일이름 확인

json_name = []
for i in json_file.keys():
    json_name.append(data[i]['filename'])

- 어노테이션 파일에서 확인되는 filename 가져와서 리스트에 담기

이미지 폴더 안의 파일리스트 가져오기

path_dir = 'data/korean_data/'
file_list = os.listdir(path_dir)

- os.listdir(경로) : 특정 경로 안에 있는 파일 리스트 가져오기

json의 파일 갯수와, 폴더안의 파일 갯수 확인

#현재 파일에 있는 이미지갯수
print(len(file_list))
#실제 제이슨파일에 있는 이미지 갯수
print(len(json_name))

어노테이션이 없는 이미지는 삭제한다.

for i in range(len(file_list)): #총 파일에 들어있는 이미지갯수만큼 반복문
    if file_list[i] in json_name: #제이슨파일에 이름이 있는지 확인
        continue
    else:
        r_image = os.path.join(path_dir, file_list[i])
        os.remove(r_image)
        print(r_image, '>>삭제 완료')

- 어노테이션 하면서, 텍스트가 없을 경우는 어노테이션 하지 않는다

- 어노테이션이 없는 이미지는 폴더에서 삭제한다

최종 남은 파일 확인

#최종 남은 파일 갯수 확인
final_file_list = os.listdir(path_dir)
print(len(final_file_list))