www.久久久久|狼友网站av天堂|精品国产无码a片|一级av色欲av|91在线播放视频|亚洲无码主播在线|国产精品草久在线|明星AV网站在线|污污内射久久一区|婷婷综合视频网站

當(dāng)前位置:首頁(yè) > 公眾號(hào)精選 > AI科技大本營(yíng)
[導(dǎo)讀]作者|小白來(lái)源|小白學(xué)視覺(jué)收集數(shù)據(jù)首先,我們要做的第一件事是創(chuàng)建一個(gè)簡(jiǎn)單的數(shù)據(jù)集,這樣我們就可以測(cè)試我們工作流程的每一部分。理想情況下,我們的數(shù)據(jù)集將包含各種易讀性和時(shí)間段的掃描文檔,以及每個(gè)文檔所屬的高級(jí)主題。我找不到具有這些精確規(guī)格的數(shù)據(jù)集,所以我開(kāi)始構(gòu)建自己的數(shù)據(jù)集。我決定...

使用深度學(xué)習(xí)閱讀和分類掃描文檔


作者|小白


來(lái)源|小白學(xué)視覺(jué)


收集數(shù)據(jù)


首先,我們要做的第一件事是創(chuàng)建一個(gè)簡(jiǎn)單的數(shù)據(jù)集,這樣我們就可以測(cè)試我們工作流程的每一部分。理想情況下,我們的數(shù)據(jù)集將包含各種易讀性和時(shí)間段的掃描文檔,以及每個(gè)文檔所屬的高級(jí)主題。我找不到具有這些精確規(guī)格的數(shù)據(jù)集,所以我開(kāi)始構(gòu)建自己的數(shù)據(jù)集。我決定的高層次話題是政府、信件、吸煙和專利,隨機(jī)的選擇這些主要是因?yàn)槊總€(gè)地區(qū)都有各種各樣的掃描文件。



我從這些來(lái)源中的每一個(gè)中挑選了 20 個(gè)左右的大小合適的文檔,并將它們放入由主題定義的單獨(dú)文件夾中。


使用深度學(xué)習(xí)閱讀和分類掃描文檔


經(jīng)過(guò)將近一整天的搜索和編目所有圖像后,我們將它們?nèi)空{(diào)整為 600x800 并將它們轉(zhuǎn)換為 PNG 格式。


使用深度學(xué)習(xí)閱讀和分類掃描文檔


簡(jiǎn)單的調(diào)整大小和轉(zhuǎn)換腳本如下:


from PIL import Image
img_folder = r'F:\Data\Imagery\OCR' # Folder containing topic folders (i.e "News", "Letters" ..etc.)
for subfol in os.listdir(img_folder): # For each of the topic folders sfpath = os.path.join(img_folder, subfol)for imgfile in os.listdir(sfpath): # Get all images in the topic imgpath = os.path.join(sfpath, imgfile) img = Image.open(imgpath) # Read in the image with Pillow img = img.resize((600,800)) # Resize the image newip = imgpath[0:-4] ".png" # Convert to PNG img.save(newip) # Save


構(gòu)建OCR管道


光學(xué)字符識(shí)別是從圖像中提取文字的過(guò)程。這通常是通過(guò)機(jī)器學(xué)習(xí)模型完成的,最常見(jiàn)的是通過(guò)包含卷積神經(jīng)網(wǎng)絡(luò)的管道來(lái)完成。雖然我們可以為我們的應(yīng)用程序訓(xùn)練自定義 OCR 模型,但它需要更多的訓(xùn)練數(shù)據(jù)和計(jì)算資源。相反,我們將使用出色的 Microsoft 計(jì)算機(jī)視覺(jué) API,其中包括專門用于 OCR 的特定模塊。API 調(diào)用將使用圖像(作為 PIL 圖像)并輸出幾位信息,包括圖像上文本的位置/方向作為以及文本本身。以下函數(shù)將接收一個(gè) PIL 圖像列表并輸出一個(gè)大小相等的提取文本列表:


def image_to_text(imglist, ndocs=10):''' Take in a list of PIL images and return a list of extracted text using OCR ''' headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': 'YOUR_KEY_HERE', }
params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true', })
outtext = [] docnum = 0
for cropped_image in imglist: print("Processing document -- ", str(docnum))# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continue ocr_image = cropped_image imgByteArr = io.BytesIO() ocr_image.save(imgByteArr, format='PNG') imgByteArr = imgByteArr.getvalue()
try: conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com') conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers) response = conn.getresponse() data = json.loads(response.read().decode("utf-8"))
curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']: curr_text.append(str(w['text'])) conn.close() except Exception as e: print("Could not process image
outtext.append(' '.join(curr_text)) docnum = 1
return(outtext)


后期處理


由于在某些情況下我們可能希望在這里結(jié)束我們的工作流程,而不是僅僅將提取的文本作為一個(gè)巨大的列表保存在內(nèi)存中,我們還可以將提取的文本寫入與原始輸入文件同名的單個(gè) txt 文件中。微軟的OCR技術(shù)雖然不錯(cuò),但偶爾也會(huì)出錯(cuò)。????我們可以使用 SpellChecker 模塊減少其中的一些錯(cuò)誤,以下腳本接受輸入和輸出文件夾,讀取輸入文件夾中的所有掃描文檔,使用我們的 OCR 腳本讀取它們,運(yùn)行拼寫檢查并糾正拼寫錯(cuò)誤的單詞,最后將原始txt文件導(dǎo)出目錄。


'''Read in a list of scanned images (as .png files > 50x50px) and output a set of .txt files containing the text content of these scans'''

from functions import preprocess, image_to_textfrom PIL import Imageimport osfrom spellchecker import SpellCheckerimport matplotlib.pyplot as plt

INPUT_FOLDER = r'F:\Data\Imagery\OCR2\Images'OUTPUT_FOLDER = r'F:\Research\OCR\Outputs\AllDocuments'

## First, read in all the scanned document images into PIL imagesscanned_docs_path = os.listdir(INPUT_FOLDER)scanned_docs_path = [x for x in scanned_docs_path if x.endswith('.png')]scanned_docs = [Image.open(os.path.join(INPUT_FOLDER, path)) for path in scanned_docs_path]

## Second, utilize Microsoft CV API to extract text from these images using OCRscanned_docs_text = image_to_text(scanned_docs)

## Third, remove mis-spellings that might have occured from bad OCR readingsspell = SpellChecker()for i in range(len(scanned_docs_text)): clean = scanned_docs_text[i] misspelled = spell.unknown(clean) clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled: clean[word] = spell.correction(clean[word])# Get the one `most likely` answer clean = ' '.join(clean) scanned_docs_text[i] = clean



## Fourth, write the extracted text to individual .txt files with the same name as input filesfor k in range(len(scanned_docs_text)): # For each scanned document
text = scanned_docs_text[k] path = scanned_docs_path[k] # Get the corresponding input filename
text_file_path = path[:-4] ".txt" # Create the output text file text_file = open(text_file_path, "wt")
n = text_file.write(text) # Write the text to the ouput text file
text_file.close()
print("Done")


為建模準(zhǔn)備文本


如果我們的掃描文檔集足夠大,將它們?nèi)繉懭胍粋€(gè)大文件夾會(huì)使它們難以分類,并且我們可能已經(jīng)在文檔中進(jìn)行了某種隱式分組。如果我們大致了解我們擁有多少種不同的“類型”或文檔主題,我們可以使用主題建模來(lái)幫助自動(dòng)識(shí)別這些。這將為我們提供基礎(chǔ)架構(gòu),以根據(jù)文檔內(nèi)容將 OCR 中識(shí)別的文本拆分為單獨(dú)的文件夾,我們將使用該主題模型被稱為L(zhǎng)DA。為了運(yùn)行這個(gè)模型,我們需要對(duì)我們的數(shù)據(jù)進(jìn)行更多的預(yù)處理和組織,因此為了防止我們的腳本變得冗長(zhǎng)和擁擠,我們將假設(shè)已經(jīng)使用上述工作流程讀取了掃描的文檔并將其轉(zhuǎn)換為 txt 文件. 然后主題模型將讀入這些 txt 文件,將它們分類到我們指定的任意多個(gè)主題中,并將它們放入適當(dāng)?shù)奈募A中。



我們將從一個(gè)簡(jiǎn)單的函數(shù)開(kāi)始,讀取文件夾中所有輸出的 txt 文件,并將它們讀入包含 (filename, text) 的元組列表。


def read_and_return(foldername, fileext='.txt'):''' Read all text files with fileext from foldername, and place them into a list of tuples as [(filename, text), ... , (filename, text)] ''' allfiles = os.listdir(foldername) allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)] alltext = []for filename in allfiles:with open(filename, 'r') as f: alltext.append((filename, f.read())) f.close()return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)] 接下來(lái),我們需要確保所有無(wú)用的詞(那些不能幫助我們區(qū)分特定文檔主題的詞)。我們將使用三種不同的方法來(lái)做到這一點(diǎn):


  1. 刪除停用詞


  2. 去除標(biāo)簽、標(biāo)點(diǎn)、數(shù)字和多個(gè)空格


  3. TF-IDF 過(guò)濾



為了實(shí)現(xiàn)所有這些(以及我們的主題模型),我們將使用 Gensim 包。下面的腳本將對(duì)文本列表(上述函數(shù)的輸出)運(yùn)行必要的預(yù)處理步驟并訓(xùn)練 LDA 模型。


from gensim import corpora, models, similaritiesfrom gensim.parsing.preprocessing import remove_stopwords, preprocess_string
def preprocess(document): clean = remove_stopwords(document) clean = preprocess_string(document) return(clean)
def run_lda(textlist, num_topics=10, preprocess_docs=True):
''' Train and return an LDA model against a list of documents '''if preprocess_docs: doc_text = [preprocess(d) for d in textlist]
dictionary = corpora.Dictionary(doc_text) corpus = [dictionary.doc2bow(text) for text in doc_text] tfidf = models.tfidfmodel.TfidfModel(corpus) transformed_tfidf = tfidf[corpus]
lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)
return(lda, dictionary)


使用模型對(duì)文檔進(jìn)行分類


一旦我們訓(xùn)練了我們的 LDA 模型,我們就可以使用它來(lái)將我們的訓(xùn)練文檔集(以及可能出現(xiàn)的未來(lái)文檔)分類為主題,然后將它們放入適當(dāng)?shù)奈募A中。


對(duì)新的文本字符串使用經(jīng)過(guò)訓(xùn)練的 LDA 模型需要一些麻煩,所有的復(fù)雜性都包含在下面的函數(shù)中:


def find_topic(textlist, dictionary, lda):''' https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensim
For each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while training and create text_corpus '''
text_corpus = []
for query in textlist: temp_doc = tokenize(query.strip()) current_doc = [] temp_doc = list(temp_doc)for word in range(len(temp_doc)): current_doc.append(temp_doc[word])
text_corpus.append(current_doc)''' For each feature vector text, lda[doc_bow] gives the topic distribution, which can be sorted in descending order to print the very first topic ''' tops = []for text in text_corpus: doc_bow = dictionary.doc2bow(text) topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0] tops.append(topics)return(tops) 最后,我們需要另一種方法來(lái)根據(jù)主題索引獲取主題的實(shí)際名稱:


def topic_label(ldamodel, topicnum): alltopics = ldamodel.show_topics(formatted=False) topic = alltopics[topicnum] topic = dict(topic[1])return(max(topic, key=lambda key: topic[key]))


現(xiàn)在,我們可以將上面編寫的所有函數(shù)粘貼到一個(gè)接受輸入文件夾、輸出文件夾和主題計(jì)數(shù)的腳本中。該腳本將讀取輸入文件夾中所有掃描的文檔圖像,將它們寫入txt 文件,構(gòu)建LDA 模型以查找文檔中的高級(jí)主題,并根據(jù)文檔主題將輸出的txt 文件歸類到文件夾中。


################################################################## This script takes in an input folder of scanned documents ## and reads these documents, seperates them into topics ## and outputs raw .txt files into the output folder, seperated ## by topic ##################################################################
import osfrom PIL import Imageimport base64import http.client, urllib.request, urllib.parse, urllib.error, base64import ioimport jsonimport requestsimport urllibfrom gensim import corpora, models, similaritiesfrom gensim.utils import tokenizefrom gensim.parsing.preprocessing import remove_stopwords, preprocess_stringimport httpimport shutilimport tqdm

def filter_for_english(text): dict_url = 'https://raw.githubusercontent.com/first20hours/' \'google-10000-english/master/20k.txt'
dict_words = set(requests.get(dict_url).text.splitlines())
english_words = tokenize(text) english_words = [w for w in english_words if w in list(dict_words)] english_words = [w for w in english_words if (len(w)>1 or w.lower()=='i')]return(' '.join(english_words))


def preprocess(document): clean = filter_for_english(document) clean = remove_stopwords(clean) clean = preprocess_string(clean)
# Remove non-english words

return(clean)

def read_and_return(foldername, fileext='.txt', delete_after_read=False): allfiles = os.listdir(foldername) allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)] alltext = []for filename in allfiles:with open(filename, 'r') as f: alltext.append((filename, f.read())) f.close()if delete_after_read: os.remove(filename)return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)]

def image_to_text(imglist, ndocs=10):''' Take in a list of PIL images and return a list of extracted text ''' headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': '89279deb653049078dd18b1b116777ea', }
params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true', })
outtext = [] docnum = 0
for cropped_image in tqdm.tqdm(imglist, total=len(imglist)):# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continue ocr_image = cropped_image imgByteArr = io.BytesIO() ocr_image.save(imgByteArr, format='PNG') imgByteArr = imgByteArr.getvalue()
try: conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com') conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers) response = conn.getresponse() data = json.loads(response.read().decode("utf-8"))
curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']: curr_text.append(str(w['text'])) conn.close()except Exception as e: print("[Errno {0}] {1}".format(e.errno, e.strerror))
outtext.append(' '.join(curr_text)) docnum = 1
return(outtext)


def run_lda(textlist, num_topics=10, return_model=False, preprocess_docs=True):
''' Train and return an LDA model against a list of documents '''if preprocess_docs: doc_text = [preprocess(d) for d in textlist] dictionary = corpora.Dictionary(doc_text)
corpus = [dictionary.doc2bow(text) for text in doc_text] tfidf = models.tfidfmodel.TfidfModel(corpus) transformed_tfidf = tfidf[corpus]
lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)
input_doc_topics = lda.get_document_topics(corpus)
return(lda, dictionary)


def find_topic(text, dictionary, lda):''' https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensim
For each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while training and create text_corpus '''
text_corpus = []
for query in text: temp_doc = tokenize(query.strip()) current_doc = [] temp_doc = list(temp_doc)for word in range(len(temp_doc)): current_doc.append(temp_doc[word])
text_corpus.append(current_doc)''' For each feature vector text, lda[doc_bow] gives the topic distribution, which can be sorted in descending order to print the very first topic ''' tops = []for text in text_corpus: doc_bow = dictionary.doc2bow(text) topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0] tops.append(topics)return(tops)

def topic_label(ldamodel, topicnum): alltopics = ldamodel.show_topics(formatted=False) topic = alltopics[topicnum] topic = dict(topic[1])import operatorreturn(max(topic, key=lambda key: topic[key]))

INPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocuments'OUTPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocumentsByTopic'TOPICS = 4
if __name__ == '__main__':

print("Reading scanned documents")## First, read in all the scanned document images into PIL images scanned_docs_fol = r'F:/Research/OCR/Outputs/AllDocuments' scanned_docs_path = os.listdir(scanned_docs_fol) scanned_docs_path = [os.path.join(scanned_docs_fol, p) for p in scanned_docs_path] scanned_docs = [Image.open(x) for x in scanned_docs_path if x.endswith('.png')]
## Second, utilize Microsoft CV API to extract text from these images using OCR scanned_docs_text = image_to_text(scanned_docs)

print("Post-processing extracted text")## Third, remove mis-spellings that might have occured from bad OCR readings spell = SpellChecker()for i in range(len(scanned_docs_text)): clean = scanned_docs_text[i] misspelled = spell.unknown(clean) clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled: clean[word] = spell.correction(clean[word])# Get the one `most likely` answer clean = ' '.join(clean) scanned_docs_text[i] = clean


print("Writing read text into files")## Fourth, write the extracted text to individual .txt files with the same name as input filesfor k in range(len(scanned_docs_text)): # For each scanned document
text = scanned_docs_text[k] text = filter_for_english(text) path = scanned_docs_path[k] # Get the corresponding input filename path = path.split("\\")[-1] text_file_path = OUTPUT_FOLDER "http://" path[0:-4] ".txt" # Create the output text file text_file = open(text_file_path, "wt")
n = text_file.write(text) # Write the text to the ouput text file
text_file.close()

# First, read all the output .txt files print("Reading files") texts = read_and_return(OUTPUT_FOLDER)

print("Building LDA topic model")# Second, train the LDA model (pre-processing is internally done) print("Preprocessing Text") textlist = [t[1] for t in texts] ldamodel, dictionary = run_lda(textlist, num_topics=TOPICS)

# Third, extract the top topic for each document print("Extracting Topics") topics = []for t in texts: topics.append((t[0], find_topic([t[1]], dictionary, ldamodel)))

# Convert topics to topic namesfor i in range(len(topics)): topnum = topics[i][1][0][0]#print(topnum) topics[i][1][0] = topic_label(ldamodel, topnum)# [(filename, topic), ..., (filename, topic)]

# Create folders for the topics print("Copying Documents into Topic Folders") foundtopics = []for t in topics: foundtopics = t[1] foundtopics = set(foundtopics) topicfolders = [os.path.join(OUTPUT_FOLDER, f) for f in foundtopics] topicfolders = set(topicfolders) [os.makedirs(m) for m in topicfolders]
# Copy files into appropriate topic foldersfor t in topics: filename, topic = t src = filename filename = filename.split("\\") dest = os.path.join(OUTPUT_FOLDER, topic[0]) dest = dest "/" filename[-1] copystr = "copy " src " " dest shutil.copyfile(src, dest) os.remove(src)
print("Done")


本文代碼Github鏈接:
使用深度學(xué)習(xí)閱讀和分類掃描文檔




本站聲明: 本文章由作者或相關(guān)機(jī)構(gòu)授權(quán)發(fā)布,目的在于傳遞更多信息,并不代表本站贊同其觀點(diǎn),本站亦不保證或承諾內(nèi)容真實(shí)性等。需要轉(zhuǎn)載請(qǐng)聯(lián)系該專欄作者,如若文章內(nèi)容侵犯您的權(quán)益,請(qǐng)及時(shí)聯(lián)系本站刪除。
關(guān)閉
關(guān)閉