Machine learning

Permanent URI for this collection

https://loar.kb.dk/handle/1902/326

Datasets for machine learning algorithms

News

Two word2vec dictionaries added, build from the following corpus 1) 65.000 Gutenberg E-books 2) 32 Million danish newspaper pages

Browse

Now showing 1 - 2 of 2

Word2Vec dictionary for 30million Danish newspaper pages
Egense, Thomas
About 30 million danish newspapers pages from 1880 to 2005 that has been digitized in the mediestream.dk project. Over 98% of the pages are in danish, but a few other languages are present in the corpus as well. This includes german, english, icelandic etc. The word2vec algorithm was used on the corpus to improve search in the newspaper by identifing OCR misspellings of words. See: https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/ Corpus size: 270GB of text splited into 2.8 billion sentences with all punctuations removed. Word2Vec takes about 2 month/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 2.4 million different words, most are OCR errors. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.
Word2Vec dictionary for 65000 Gutenberg E-books
Egense, Thomas
Description: 55,000 e-books from Project Gutenberg (http://www.gutenberg.org/). About 35.000 books are english, but over 50 different languages are represented. The word2vec algorithm does a good job at seperating the different languages, so it is almost like it is 50 different word2vec dictionaries. Corpus size: 30GB of text spliteded into in 230 million sentences sentences with all punctuations removed. Word2Vec takes about 1.5 week/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 1.4 million different words, from over 50 different languages. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.

Browse

Recent Submissions