Word2Vec dictionary for 30million Danish newspaper pages

No Thumbnail Available
Egense, Thomas
Journal Title
Journal ISSN
Volume Title
About 30 million danish newspapers pages from 1880 to 2005 that has been digitized in the mediestream.dk project. Over 98% of the pages are in danish, but a few other languages are present in the corpus as well. This includes german, english, icelandic etc. The word2vec algorithm was used on the corpus to improve search in the newspaper by identifing OCR misspellings of words. See: https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/ Corpus size: 270GB of text splited into 2.8 billion sentences with all punctuations removed. Word2Vec takes about 2 month/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 2.4 million different words, most are OCR errors. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.
word2vec, machine learning, NLP