Word2Vec dictionary for 30million Danish newspaper pages

Egense, Thomas

Word2Vec dictionary for 30million Danish newspaper pages

Files

danish_newspapers_description.txt(1004 B)

danish_newspapers_1880To2013.txt(6.4 GB)

Authors

Egense, Thomas

Abstract

About 30 million danish newspapers pages from 1880 to 2005 that has been digitized in the mediestream.dk project. Over 98% of the pages are in danish, but a few other languages are present in the corpus as well. This includes german, english, icelandic etc. The word2vec algorithm was used on the corpus to improve search in the newspaper by identifing OCR misspellings of words. See: https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/ Corpus size: 270GB of text splited into 2.8 billion sentences with all punctuations removed. Word2Vec takes about 2 month/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 2.4 million different words, most are OCR errors. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.

Keywords

word2vec, machine learning, NLP

URI

https://loar.kb.dk/handle/1902/329
http://dx.doi.org/10.21994/loar159

Collections

Machine learning

Full item page