Word2Vec dictionary for 30million Danish newspaper pages

dc.contributor.authorEgense, Thomas
dc.date.accessioned2018-07-03T10:09:06Z
dc.date.available2018-07-03T10:09:06Z
dc.description.abstractAbout 30 million danish newspapers pages from 1880 to 2005 that has been digitized in the mediestream.dk project. Over 98% of the pages are in danish, but a few other languages are present in the corpus as well. This includes german, english, icelandic etc. The word2vec algorithm was used on the corpus to improve search in the newspaper by identifing OCR misspellings of words. See: https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/ Corpus size: 270GB of text splited into 2.8 billion sentences with all punctuations removed. Word2Vec takes about 2 month/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 2.4 million different words, most are OCR errors. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.en_US
dc.identifier.urihttps://loar.kb.dk/handle/1902/329
dc.identifier.urihttp://dx.doi.org/10.21994/loar159
dc.language.isodken_US
dc.relation.isreferencedbyhttps://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/en_US
dc.rightsCC Public Domain*
dc.rights.urihttps://creativecommons.org/publicdomain/mark/1.0/deed.en*
dc.subjectword2vecen_US
dc.subjectmachine learningen_US
dc.subjectNLPen_US
dc.titleWord2Vec dictionary for 30million Danish newspaper pagesen_US
dc.typeLearning Objecten_US
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
danish_newspapers_description.txt
Size:
1004 B
Format:
Plain Text
Description:
description
Loading...
Thumbnail Image
Name:
danish_newspapers_1880To2013.txt
Size:
6.4 GB
Format:
Plain Text
Description:
Word2Vec dictionary file in text format for 30million Danish newspaper pages
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
4.47 KB
Format:
Item-specific license agreed upon to submission
Description:
Collections