Machine learning
Permanent URI for this collection
Datasets for machine learning algorithms
News
Two word2vec dictionaries added, build from the following corpus
1) 65.000 Gutenberg E-books
2) 32 Million danish newspaper pages
Browse
Recent Submissions
- ItemWord2Vec dictionary for 30million Danish newspaper pagesEgense, ThomasAbout 30 million danish newspapers pages from 1880 to 2005 that has been digitized in the mediestream.dk project. Over 98% of the pages are in danish, but a few other languages are present in the corpus as well. This includes german, english, icelandic etc. The word2vec algorithm was used on the corpus to improve search in the newspaper by identifing OCR misspellings of words. See: https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/ Corpus size: 270GB of text splited into 2.8 billion sentences with all punctuations removed. Word2Vec takes about 2 month/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 2.4 million different words, most are OCR errors. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.
- ItemWord2Vec dictionary for 65000 Gutenberg E-booksEgense, ThomasDescription: 55,000 e-books from Project Gutenberg (http://www.gutenberg.org/). About 35.000 books are english, but over 50 different languages are represented. The word2vec algorithm does a good job at seperating the different languages, so it is almost like it is 50 different word2vec dictionaries. Corpus size: 30GB of text spliteded into in 230 million sentences sentences with all punctuations removed. Word2Vec takes about 1.5 week/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word window size: 5 Iterations: 10 Minimum word frequency: 100 Dimensions:300 Ouput format:text Word2vec dictionary file: 1.4 million different words, from over 50 different languages. Requirements: Opening the dictionary in a word2vec implementation will require 16GB of memory.