flusight-vignette.Rmd
The import_flusight_csv
function uses several helper functions to convert FluSight-specific formatted CSV forecasts. The file name should contain the MMWR epidemiological week assigned to the forecast (e.g. “EW42”) followed by a dash (“-”), the team name (e.g. “Hist_Avg”) followed by a dash (“-”), and the submission date as an ISO data (YYYY-MM-DD, e.g. “2018-10-29”).
This import process creates a special, embedded “predx_df
” data frame (a tbl_df
, “tibble” object) with a row for every individual prediction (defined by target, location, and prediction type), a column predx_class
that defines the class of each prediction, and a column predx
that is a list of predx objects. In the process of importation, all predictions are validated (e.g. binned predictions must sum to 1.0) or (hopefully useful) errors messages are returned in the predx
column.
For this example, we use a submission with only national-level forecasts.
library(predx)
fcast <- import_flusight_csv('EW42-Hist-Avg-2018-10-29-National.csv')
class(fcast)
#> [1] "tbl_df" "tbl" "data.frame"
fcast
#> # A tibble: 14 x 8
#> location target unit team mmwr_week submission_date predx_class predx
#> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <lis>
#> 1 US Natio… 1 wk ahead perce… Hist… 42 2018-10-29 Point <Poi…
#> 2 US Natio… 1 wk ahead perce… Hist… 42 2018-10-29 BinLwr <Bin…
#> 3 US Natio… 2 wk ahead perce… Hist… 42 2018-10-29 Point <Poi…
#> 4 US Natio… 2 wk ahead perce… Hist… 42 2018-10-29 BinLwr <Bin…
#> 5 US Natio… 3 wk ahead perce… Hist… 42 2018-10-29 Point <Poi…
#> 6 US Natio… 3 wk ahead perce… Hist… 42 2018-10-29 BinLwr <Bin…
#> 7 US Natio… 4 wk ahead perce… Hist… 42 2018-10-29 Point <Poi…
#> 8 US Natio… 4 wk ahead perce… Hist… 42 2018-10-29 BinLwr <Bin…
#> 9 US Natio… Season on… week Hist… 42 2018-10-29 Point <Poi…
#> 10 US Natio… Season on… week Hist… 42 2018-10-29 BinCat <Bin…
#> 11 US Natio… Season pe… perce… Hist… 42 2018-10-29 Point <Poi…
#> 12 US Natio… Season pe… perce… Hist… 42 2018-10-29 BinLwr <Bin…
#> 13 US Natio… Season pe… week Hist… 42 2018-10-29 Point <Poi…
#> 14 US Natio… Season pe… week Hist… 42 2018-10-29 BinCat <Bin…
The verify_expected
function can be used to check that expected targets are included in the predx_df
. For example, the national-level FluSight forecasts include forecasts for seven targets and 11 locations. The percentage forecasts (e.g. “Season peak percentage”, “1 wk ahead”) are classified as BinLwr
predictions and the week targets (“Season onset”, “Season peak week”) are classifed as BinCat
predictions. This set can be specified in a list with one list for each combination of required characteristics. For example, the target “Season peak percentage”, should have a prediction for each location and each predx bin as specified in the first element of flusight_ilinet_expected()
, below.
list(
list(
target = c("Season peak percentage", "1 wk ahead", "2 wk ahead",
"3 wk ahead", "4 wk ahead"),
location = c("HHS Region 1", "HHS Region 10", "HHS Region 2", "HHS Region 3",
"HHS Region 4", "HHS Region 5", "HHS Region 6", "HHS Region 7",
"HHS Region 8", "HHS Region 9", "US National"),
predx_class = "BinLwr",
lwr = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2,
1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2, 2.1, 2.2, 2.3, 2.4, 2.5,
2.6, 2.7, 2.8, 2.9, 3, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
3.9, 4, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5, 5.1,
5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6, 6.1, 6.2, 6.3, 6.4,
6.5, 6.6, 6.7, 6.8, 6.9, 7, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7,
7.8, 7.9, 8, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9,
9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10, 10.1, 10.2,
10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 11, 11.1, 11.2, 11.3,
11.4, 11.5, 11.6, 11.7, 11.8, 11.9, 12, 12.1, 12.2, 12.3, 12.4,
12.5, 12.6, 12.7, 12.8, 12.9, 13)
),
list(
target = "Season peak week",
location = c("HHS Region 1", "HHS Region 10", "HHS Region 2", "HHS Region 3",
"HHS Region 4", "HHS Region 5", "HHS Region 6", "HHS Region 7",
"HHS Region 8", "HHS Region 9", "US National"),
predx_class = "BinCat",
cat = c("40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
),
list(
target = "Season onset",
location = c("HHS Region 1", "HHS Region 10", "HHS Region 2", "HHS Region 3",
"HHS Region 4", "HHS Region 5", "HHS Region 6", "HHS Region 7",
"HHS Region 8", "HHS Region 9", "US National"),
predx_class = "BinCat",
cat = c("40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"none")
),
list(
target = c("Season peak percentage", "1 wk ahead", "2 wk ahead",
"3 wk ahead", "4 wk ahead", "Season onset", "Season peak week"),
location = c("HHS Region 1", "HHS Region 10", "HHS Region 2", "HHS Region 3",
"HHS Region 4", "HHS Region 5", "HHS Region 6", "HHS Region 7",
"HHS Region 8", "HHS Region 9", "US National"),
predx_class = "Point"
)
This specific expected list, available with flusight_ilinet_expected
, is included in the predx package along with flusight_state_ilinet_expected
(for state ILINet forecasts) and flusight_hospitalization_expected
(for hospitalization forecasts). Other verification lists can be made in this format and all can be used with the function verify_expected
to validate that all expected predictions are included. The function prints missing and additional predictions, but those can also be returned as a data frame by including the argument return_df = TRUE
.
For this example, we limit the expected forecasts to the national level.
national_expected <- list(
list(
target = c("Season peak percentage", "1 wk ahead", "2 wk ahead",
"3 wk ahead", "4 wk ahead"),
location = "US National",
predx_class = "BinLwr",
lwr = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2,
1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2, 2.1, 2.2, 2.3, 2.4, 2.5,
2.6, 2.7, 2.8, 2.9, 3, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8,
3.9, 4, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5, 5.1,
5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6, 6.1, 6.2, 6.3, 6.4,
6.5, 6.6, 6.7, 6.8, 6.9, 7, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7,
7.8, 7.9, 8, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9,
9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9, 10, 10.1, 10.2,
10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 11, 11.1, 11.2, 11.3,
11.4, 11.5, 11.6, 11.7, 11.8, 11.9, 12, 12.1, 12.2, 12.3, 12.4,
12.5, 12.6, 12.7, 12.8, 12.9, 13)
),
list(
target = "Season peak week",
location = "US National",
predx_class = "BinCat",
cat = c("40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
),
list(
target = "Season onset",
location = "US National",
predx_class = "BinCat",
cat = c("40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"none")
),
list(
target = c("Season peak percentage", "1 wk ahead", "2 wk ahead",
"3 wk ahead", "4 wk ahead", "Season onset", "Season peak week"),
location = "US National",
predx_class = "Point"
)
)
verify_expected(fcast, national_expected)
#> All expected predictions found.
To save space and facilitate sharing, transfer, and storage of predx_df
objects, they can be exported as JSON objects using the function export_json
. A file name can be supplied as an argument to store this as a file instead of returning an object as shown for the peak week predictions in fcast
below.
peak_week_json <- export_json(fcast[13:14, ])
peak_week_json
#> [{"location":"US National","target":"Season peak week","unit":"week","team":"Hist-Avg","mmwr_week":42,"submission_date":"2018-10-29","predx_class":"Point","predx":{"point":5}},{"location":"US National","target":"Season peak week","unit":"week","team":"Hist-Avg","mmwr_week":42,"submission_date":"2018-10-29","predx_class":"BinCat","predx":{"cat":["40","41","42","43","44","45","46","47","48","49","50","51","52","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20"],"prob":[2.71765295790689e-018,1.05098154210838e-017,5.62064673333472e-018,1.10633017835077e-018,6.14565616231136e-018,3.20292319880939e-015,2.73138424595409e-011,4.81833260313942e-008,1.78568393149462e-005,0.00141898430676559,0.0250993223739668,0.10365445494028,0.103891022982423,0.0292895523076136,0.0196608506730169,0.0349911259046295,0.0912038352325736,0.159595360227116,0.17640031092917,0.121576720796766,0.0435668625182061,0.0268396648526282,0.0348099949396376,0.0214596885860514,0.00441747999649527,0.000344280326810563,0.000251797579413183,0.000251797579413183,0.000251797579413183,0.000251797579413183,0.000251797579413183,0.000251797579413183,0.000251797579413183]}}]
jsonlite::prettify(peak_week_json)
#> [
#> {
#> "location": "US National",
#> "target": "Season peak week",
#> "unit": "week",
#> "team": "Hist-Avg",
#> "mmwr_week": 42,
#> "submission_date": "2018-10-29",
#> "predx_class": "Point",
#> "predx": {
#> "point": 5
#> }
#> },
#> {
#> "location": "US National",
#> "target": "Season peak week",
#> "unit": "week",
#> "team": "Hist-Avg",
#> "mmwr_week": 42,
#> "submission_date": "2018-10-29",
#> "predx_class": "BinCat",
#> "predx": {
#> "cat": [
#> "40",
#> "41",
#> "42",
#> "43",
#> "44",
#> "45",
#> "46",
#> "47",
#> "48",
#> "49",
#> "50",
#> "51",
#> "52",
#> "1",
#> "2",
#> "3",
#> "4",
#> "5",
#> "6",
#> "7",
#> "8",
#> "9",
#> "10",
#> "11",
#> "12",
#> "13",
#> "14",
#> "15",
#> "16",
#> "17",
#> "18",
#> "19",
#> "20"
#> ],
#> "prob": [
#> 2.71765295790689e-018,
#> 1.05098154210838e-017,
#> 5.62064673333472e-018,
#> 1.10633017835077e-018,
#> 6.14565616231136e-018,
#> 3.20292319880939e-015,
#> 2.73138424595409e-011,
#> 4.81833260313942e-008,
#> 1.78568393149462e-005,
#> 0.00141898430676559,
#> 0.0250993223739668,
#> 0.10365445494028,
#> 0.103891022982423,
#> 0.0292895523076136,
#> 0.0196608506730169,
#> 0.0349911259046295,
#> 0.0912038352325736,
#> 0.159595360227116,
#> 0.17640031092917,
#> 0.121576720796766,
#> 0.0435668625182061,
#> 0.0268396648526282,
#> 0.0348099949396376,
#> 0.0214596885860514,
#> 0.00441747999649527,
#> 0.000344280326810563,
#> 0.000251797579413183,
#> 0.000251797579413183,
#> 0.000251797579413183,
#> 0.000251797579413183,
#> 0.000251797579413183,
#> 0.000251797579413183,
#> 0.000251797579413183
#> ]
#> }
#> }
#> ]
#>
json_tempfile = tempfile()
export_json(fcast, filename = json_tempfile, overwrite = T)
Alternatively, predx_df
objects can be exported as predx CSV files or FluSight-formatted CSV files (those used for submissions on the FluSight website). These CSV formats differ in several ways:
predx
CSV contains additional columns for MMWR epidemic week, team, and submission date. In the FluSight CSV these are included in the file name only.predx
CSV includes point predictions as individual rows whereas the FluSight CSV includes them in an additional column.These differences are the results of making predx
JSON and CSV formats more generic than the specific FluSight implementation.
csv_tempfile <- tempfile()
export_csv(fcast, filename = csv_tempfile, overwrite = T)
export_flusight_csv(fcast[1:2, ])
#> # A tibble: 132 x 7
#> location target type unit bin_start_incl bin_end_notincl value
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 US National 1 wk ahead Point percent <NA> <NA> 1.30e+ 0
#> 2 US National 1 wk ahead Bin percent 0.0 0.1 0.
#> 3 US National 1 wk ahead Bin percent 0.1 0.2 1.17e-17
#> 4 US National 1 wk ahead Bin percent 0.2 0.3 1.46e-17
#> 5 US National 1 wk ahead Bin percent 0.3 0.4 6.80e-18
#> 6 US National 1 wk ahead Bin percent 0.4 0.5 3.02e-18
#> 7 US National 1 wk ahead Bin percent 0.5 0.6 2.61e-18
#> 8 US National 1 wk ahead Bin percent 0.6 0.7 3.78e-18
#> 9 US National 1 wk ahead Bin percent 0.7 0.8 8.30e-12
#> 10 US National 1 wk ahead Bin percent 0.8 0.9 2.46e- 5
#> # … with 122 more rows
In addition to importing CSV files in the FluSight format (as described at the beginning of this vignette), predx
can be used to import predx
JSON or predx
CSV files.
fcast <- import_json(json_tempfile)
head(fcast)
#> # A tibble: 6 x 8
#> location target unit team mmwr_week submission_date predx_class predx
#> <chr> <chr> <chr> <chr> <int> <chr> <chr> <list>
#> 1 US Nation… 1 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 2 US Nation… 1 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
#> 3 US Nation… 2 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 4 US Nation… 2 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
#> 5 US Nation… 3 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 6 US Nation… 3 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
fcast_csv <- import_csv(csv_tempfile)
head(fcast_csv)
#> # A tibble: 6 x 8
#> location target unit team mmwr_week submission_date predx_class predx
#> <chr> <chr> <chr> <chr> <int> <chr> <chr> <list>
#> 1 US Nation… 1 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 2 US Nation… 1 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
#> 3 US Nation… 2 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 4 US Nation… 2 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
#> 5 US Nation… 3 wk ah… perce… Hist-… 42 2018-10-29 Point <Poin…
#> 6 US Nation… 3 wk ah… perce… Hist-… 42 2018-10-29 BinLwr <BinL…
The FluSight
package https://github.com/jarad/FluSight includes numerous functions to work with forecasts, such as scoring and visualization. predx
FluSight forecast can be converted to the format used by the FluSight
package using to_flusight_pkg_format
.
fcast_flusight <- to_flusight_pkg_format(fcast)
head(fcast_flusight)
#> # A tibble: 6 x 8
#> location target type unit bin_start_incl bin_end_notincl value
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 US Nati… 1 wk … Point perc… <NA> <NA> 1.30e+ 0
#> 2 US Nati… 1 wk … Bin perc… 0.0 0.1 0.
#> 3 US Nati… 1 wk … Bin perc… 0.1 0.2 1.17e-17
#> 4 US Nati… 1 wk … Bin perc… 0.2 0.3 1.46e-17
#> 5 US Nati… 1 wk … Bin perc… 0.3 0.4 6.80e-18
#> 6 US Nati… 1 wk … Bin perc… 0.4 0.5 3.02e-18
#> # … with 1 more variable: forecast_week <dbl>
This can then be used for scoring. Note that in this example, fcast
only contains national level forecasts.
library(FluSight)
truth_1819 <- FluSight::create_truth(year = 2018)
truth_1819 <- dplyr::filter(truth_1819, location == 'US National')
FluSight::score_entry(fcast_flusight, truth_1819)
#> # A tibble: 7 x 4
#> location target score forecast_week
#> <chr> <chr> <dbl> <dbl>
#> 1 US National Season onset -2.33 42
#> 2 US National Season peak week -2.11 42
#> 3 US National Season peak percentage -3.94 42
#> 4 US National 1 wk ahead -4.53 42
#> 5 US National 2 wk ahead -3.16 42
#> 6 US National 3 wk ahead -3.71 42
#> 7 US National 4 wk ahead -2.99 42