---
title: "Streaming Notebook"
output:
  pdf_document: default
  html_notebook: default
---

This is the Streaming Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.

```{r introduction}
print("Hello Streaming team!")

```

Now you understand how to use the notebook. Proceed by executing the next chunks.

# Part 1: Setup and Import

In this part we setup our environment and import the data from the Streaming network folder.

## Setup

First we need to import some required packages.

```{r setup}
## Describe each library
library(tidyverse)
library(urltools)
```

Then we specify the data format for our CSV files. This is required because some data does not contain values in the fields and R will default to type 'logical' for the field called 'arc_harvest'.

```{r data_format}
# First we specify all columns of our metadata
metadata_format <- cols(
  id = col_character(),
  arc_harvest = col_double(),
  arc_job = col_double(),
  crawl_date = col_character(),
  wayback_date = col_double(),
  hash = col_character(),
  domain = col_character(),
  url = col_character(),
  title = col_character(),
  content_type_norm = col_character()
)

```

Then we specify the path information for the CSV files. Please note! The path information is for Mac computers. If you have a Windows PC you need to copy-paste the proper path information. Right-click on the file in Windows File Explorer and choose Properties to reveal the path information. Then replace the path for the data file with the Windows path.

```{r data_path}
# Path to metadata files 
# Pages with streamingtjeneste. This metadata file is 2.7 GB on disk.
datafile_st <- '//kb.dk//Depot//...//metadata.csv'
# English pages. This metadata file is 12 MB on disk.
datafile_se <- '//kb.dk//Depot//...//text_streamingservice//metadata.csv'


```

## Import Data

We can now import data from the CSV files into R.

```{r import_data}
# Read metadata for all streamingtjeneste pages
df_metadata_st <- read_csv(datafile_st, quote = "\"", n_max = Inf, col_types = metadata_format)

# Read metadata for all streaming service pages
df_metadata_se <- read_csv(datafile_se, quote = "\"", n_max = Inf, col_types = metadata_format)




```

Combine the data into one data frame.

```{r}
# The data frame to work on,  make a variable for the
# language called 'lang'
df_st <- df_metadata_st %>% 
  mutate(lang = "DA")
  
df_se <- df_metadata_se %>% 
  mutate(lang = "EN")

# Bind rows to one dataframe
df_st_se <- bind_rows(df_st, df_se)

```

Add extra columns to the data frame.

```{r add_columns}
#  Make language a factor and save the data frame as 'df1'
df1 <- df_st_se %>% 
  filter(!is.na(lang)) %>%
  mutate(lang = fct_lump(lang, n = 2)) 

df2 <- df1 %>% 
  mutate(
    # We decode IDNAs on field 'url' to begin with
    url = puny_decode(url), 
    domain = domain(url),
    # mutate new field 'sha1' containing only the SHA-1 digest
    sha1 = substr(hash, 6, 37)
  )

```

## Remove duplicates

A duplicate is a binary identical document whether it is a HTML document, an image, a stylesheet or any other file from the website. If a document has not changed between harvests it will result in duplicates.

We wish to remove these duplicates from our data set so we can work on unique material.

We first save all duplicates in a data frame. How many are there in total?

```{r duplicates}
# Save all duplicates
all_duplicates <- df2[duplicated(df2$sha1),]

# How many?
all_duplicates %>% nrow()

```

How many of them are Danish pages?

```{r duplicates_danish}
all_duplicates %>% filter(lang == "DA") %>% nrow()

```

How many of them are English pages?

```{r duplicates_english}
all_duplicates %>% filter(lang == "EN") %>% nrow()

```

Now, remove the duplicates. How many unique records do we have then?

```{r remove_dup}
# Remove duplicates, by indexing non-duplicates by row No.
df3 <- df2[!duplicated(df2$sha1),]
df3 %>% nrow()
```

Add datetime field so we can work on time data

```{r add_datetime}
# Add datetime field
df4 <- df3 %>% 
  mutate(    
    #crawl_year = substr(as.character(wayback_date), 1, 4)
    # split field 'wayback_date' into date and time components
    # split date into fields year, month, day
    year = (wayback_date %/% 1000000) %/% 10000,
    month = (wayback_date %/% 1000000) %% 10000 %/% 100,
    day = (wayback_date %/% 1000000) %% 100,
    
    # split time into fields hour, minute, second
    hour = (wayback_date %% 1000000) %/% 10000,
    minute = (wayback_date %% 1000000) %% 10000 %/% 100,
    second = (wayback_date %% 1000000) %% 100,
    
    # create datetime field
    crawl_dt = lubridate::make_datetime(year, month, day, hour, minute, second)
  ) %>%  
  # remove time fields we don't need
  select(-day, -hour, -minute, -second)


```

# Part 2: Analysis

Once you have loaded the data into R, you can proceed with selecting a year to analyze.

## Select a year

In part 1 we imported and prepared our data. Now it is time to work on it. Let's pick a year.

You should change the year and run all chunks below to make plots etc. for the specific year. You only need to come back here and change it and run all chunks below NOT those above.

```{r select_year}
# Set the year for analysis. This is important as the value will be used in 
# all chunks below.
selected_year <-  2002

```

## Remove versions

A version is a document which shares the same URL as another document. Please note it is not the same as a binary duplicate. Often a version has almost the same content. However, a newspaper front page will usually differ significantly even if the versions are close to each other in time.

We decide to group our data into months and then remove any versions within that month. We have to do a selection between the documents (or versions) and we opt to select the first in time.

```{r one_per_month}
# Here we select the first version in time
df_1version_monthly <- df4 %>%
  filter(year == selected_year) %>% 
  group_by(month, url) %>%
  filter(crawl_dt == min(crawl_dt, na.rm = TRUE)) %>% # pick the first in time
  # Very important to apply ungroup()
  ungroup() %>% 
  arrange(arc_job)
```

## Plots

Plot the yearly graph showing both English and Danish documents.

```{r plot_year_both}
df_1version_monthly %>% 
  ggplot() + 
  geom_bar(mapping = aes(x = month, fill = lang)) +
  #scale_x_date(labels = date_format("%Y"))
  scale_x_continuous(name ="Month", breaks=seq(1,12)) +
  # normalize scales: scale_y_continuous(name ="No. of Documents", breaks=seq(1,12,1))
  scale_y_continuous(name ="No. of Documents") +
  labs(title = selected_year)

```

Plot the streamingtjeneste documents.

```{r plot_danish}
df_1version_monthly %>% 
  filter(lang == "DA") %>% 
  ggplot() +
  geom_bar(mapping = aes(x = month)) +
  #scale_x_date(labels = date_format("%Y"))
  scale_x_continuous(name ="Month", breaks=seq(1,12)) +
  # normalize scales: scale_y_continuous(name ="No. of Documents", breaks=seq(1,12,1))
  scale_y_continuous(name ="No. of Documents") +
  labs(title = paste(selected_year, "- Danish documents"))

```

Plot the streaming service documents.

```{r plot_english}
df_1version_monthly %>% 
  filter(lang == "EN") %>% 
  ggplot() +
  geom_bar(mapping = aes(x = month)) +
  #scale_x_date(labels = date_format("%Y"))
  scale_x_continuous(name ="Month", breaks=seq(1,12)) +
  # normalize scales: scale_y_continuous(name ="No. of Documents", breaks=seq(1,12,1))
  scale_y_continuous(name ="No. of Documents") +
  labs(title = paste(selected_year, "- English documents"))

```

## Write CSV file w. domains statistics

Here we make a CSV file with the number of Streaming pages per domain per month.

```{r csv_domain_per_month}
csv_file <- paste0("CSV\\domains_statistics_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_1version_monthly %>% 
  count(month, domain, lang) %>% 
  arrange(month, desc(n)) %>% 
  write_csv2(csv_file)
```

## Link Analysis

### Links from Streamingtjeneste Pages

Make a data frame with all links from streamingtjeneste pages.

```{r load_st_links}
# File path for the data file containing links (Danish pages)
datafile_links_st <- '//kb.dk//Depot//...//text_streamingtjeneste//links.csv'

# We read in the links dataset. We select only the fields `id` and `links_domains`.
# Links in Danish documents
df_input_links_st <- read_csv(datafile_links_st, col_names = TRUE, cols_only(id = col_character(), links_domains = col_character()))
```

Extract the links.

```{r extract_st_links}
df_links_st <- df_1version_monthly %>% 
  right_join(df_input_links_st, by = "id") %>% 
  select(id, year, month, links_domains, domain, url, crawl_dt) %>% 
  # We are not interested in documents without an URL
  filter(!is.na(url)) %>%
  # notable change from parent script (mHealth): we need to encode the input to allow for special characters in the URLs such as ã, ¦,	or ¥
  mutate(domain = parse_character(domain, locale = locale(encoding = "latin1"))) %>% 
  mutate(domain2 = paste(suffix_extract(domain)$domain, suffix_extract(domain)$suffix, sep = ".")) %>% 
  # remove extra "" that is present in the data from Netarkivet
  mutate(links_domains = str_replace_all(links_domains, "\"", "")) %>% 
  # remove the documents that have no links
  filter(links_domains != "") %>% 
  mutate(link = stringr::str_split(links_domains, " ")) %>% 
  # we don't want links_domains repeated for all rows, so we remove it
  select(-links_domains) %>%
  # If you have a list-column, unnest() makes each element of the list its own row
  unnest(cols = c(link)) %>% 
  select(id, year, month, domain, url, crawl_dt, source = domain2, target = link) %>%
  # remove internal links
  filter(source != target) 
```

Write it to CSV file.

```{r write_st_links}
csv_file <- paste0("CSV\\links_st_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links_st %>% 
  write_csv2(csv_file)
```

Make count of links.

```{r st_links_count}
# another notable change from the parent script (mHealth). The following chunk did not produce the expected count of links. Our adjusted chunk is added directly below (line 336) 
# df_links_st_count <- df_links_st %>% 
#   group_by(month, source, target) %>% 
#   mutate(count = n()) %>% 
#   select(year, month, source, target, count) %>%
#   arrange(year, month, source) %>% 
#   distinct()

df_links_st_count <- df_links_st %>%
 select(-c(id,year,crawl_dt)) %>%
separate_rows(target,sep='\\t') %>% 
group_by(month,source,target) %>% 
mutate(count=n()) %>% 
  select(month,source,target,count) %>% 
distinct()

```

Write count of links to CSV file.

```{r write_st_links_count}
csv_file <- paste0("CSV\\links_st_count_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links_st_count %>% 
  write_csv2(csv_file)
```

### Links from Streaming Service Pages

Make a data frame with all links from English pages.

```{r load_se_links}
# File path for the data file containing links (English pages)
datafile_links_se <- '//kb.dk//Depot//...//links.csv'

# We read in the links dataset. We select only the fields `id` and `links_domains`.
# Links in English documents
df_input_links_se <- read_csv(datafile_links_se, col_names = TRUE, cols_only(id = col_character(), links_domains = col_character()))
```

Extract the links.

```{r extract_se_links}
df_links_se <- df_1version_monthly %>% 
  right_join(df_input_links_se, by = "id") %>% 
  select(id, year, month, links_domains, domain, url, crawl_dt) %>% 
  # We are not interested in documents without an URL
  filter(!is.na(url)) %>%
  # notable change from parent script (mHealth): we need to encode the input to allow for special characters in the URLs        such as ã, ¦,	or ¥
  mutate(domain = parse_character(domain, locale = locale(encoding = "latin1"))) %>%
  mutate(domain2 = paste(suffix_extract(domain)$domain, suffix_extract(domain)$suffix, sep = ".")) %>% 
  # remove extra "" that is present in the data from Netarkivet
  mutate(links_domains = str_replace_all(links_domains, "\"", "")) %>% 
  # remove the documents that have no links
  filter(links_domains != "") %>% 
  mutate(link = stringr::str_split(links_domains, " ")) %>% 
  # we don't want links_domains repeated for all rows, so we remove it
  select(-links_domains) %>%
  # If you have a list-column, unnest() makes each element of the list its own row
  unnest(cols = c(link)) %>% 
  select(id, year, month, domain, url, crawl_dt, source = domain2, target = link) %>%
  # remove internal links
  filter(source != target) 
```

Write it to CSV file.

```{r write_se_links}
csv_file <- paste0("CSV\\links_se_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links_se %>% 
  write_csv2(csv_file)
```

Make count of links.

```{r se_links_count}
# df_links_count_se <- df_links_se %>% 
#   group_by(month, source, target) %>% 
#   mutate(count = n()) %>% 
#   select(year, month, source, target, count) %>%
#   arrange(year, month, source) %>% 
#   distinct()

df_links_st_count <- df_links_st %>%
 select(-c(id,year,crawl_dt)) %>%
separate_rows(target,sep='\\t') %>% 
group_by(month,source,target) %>% 
mutate(count=n()) %>% 
  select(month,source,target,count) %>% 
distinct()

```

Write count of links to CSV file.

```{r write_se_links_count}
csv_file <- paste0("CSV\\links_se_count_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links_count_se %>% 
  write_csv2(csv_file)
```

### Links from both Streamingtjeneste and Streaming service Pages

The following chunk makes a data frame with all links, both Danish and English pages.

```{r load_all_links}
# File path for the data file containing links (Danish pages)
datafile_st_links <- '//kb.dk//...//links.csv'

# File path for the data file containing links (English pages)
datafile_se_links <- '//kb.dk//Depot//...//links.csv'

# We read in all documents from the links dataset. We select only the fields `id` and `links_domains`.
# Links in Danish documents
df_st_links <- read_csv(datafile_st_links, col_names = TRUE, cols_only(id = col_character(), links_domains = col_character()))

# Links in English documents
df_se_links <- read_csv(datafile_se_links, col_names = TRUE, cols_only(id = col_character(), links_domains = col_character()))

# Bind rows to one dataframe
df_all_links <- bind_rows(df_st_links, df_se_links)
```

Extract the links.

```{r extract_all_links}
df_links <- df_1version_monthly %>% 
  right_join(df_all_links, by = "id") %>% 
  select(id, year, month, links_domains, domain, url, crawl_dt) %>% 
  # We are not interested in documents without an URL
  filter(!is.na(url)) %>%
  # notable change from parent script (mHealth): we need to encode the input to allow for special characters in the URLs
  mutate(domain = parse_character(domain, locale = locale(encoding = "latin1"))) %>% 
  mutate(domain2 = paste(suffix_extract(domain)$domain, suffix_extract(domain)$suffix, sep = ".")) %>% 
  # remove extra "" that is present in the data from Netarkivet
  mutate(links_domains = str_replace_all(links_domains, "\"", "")) %>% 
  # remove the documents that have no links
  filter(links_domains != "") %>% 
  mutate(link = stringr::str_split(links_domains, " ")) %>% 
  # we don't want links_domains repeated for all rows, so we remove it
  select(-links_domains) %>%
  # If you have a list-column, unnest() makes each element of the list its own row
  unnest(cols = c(link)) %>% 
  select(id, year, month, domain, url, crawl_dt, source = domain2, target = link) %>%
  # remove internal links
  filter(source != target) 
```

Write it to CSV file.

```{r write_all_links}
csv_file <- paste0("CSV\\links_all_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links %>% 
  write_csv2(csv_file)

```

Make count of links.

```{r all_links_count}
# Here I added the adjusted chunk for counting links as above. 
# df_links_count <- df_links %>% 
#   group_by(month, source, target) %>% 
#   mutate(count = n()) %>% 
#   select(year, month, source, target, count) %>%
#   arrange(year, month, source) %>% 
#   distinct()

df_links_st_count <- df_links_st %>%
 select(-c(id,year,crawl_dt)) %>%
separate_rows(target,sep='\\t') %>% 
group_by(month,source,target) %>% 
mutate(count=n()) %>% 
  select(month,source,target,count) %>% 
distinct()

```

Write count of links to CSV file.

```{r write_all_links_count}
csv_file <- paste0("CSV\\links_all_count_", selected_year, ".csv")
print(paste("Writing file", csv_file))

df_links_count %>% 
  write_csv2(csv_file)
```