Predicting the Best Times to Spot Wildlife

A scoring function for wildlife spotting opportunities in Australia

Author

Shreya Gupta

Published

March 31, 2026

Overview

The ecotourism package contains occurrence records for four Australian organisms alongside daily weather data from 2014–2024. A natural question for any ecotourist is:

“When and where should I go to have the best chance of spotting this animal?”

This document develops predict_best_times() - a function that analyses historical sighting patterns alongside weather conditions to recommend the top five month and time-of-day combinations for spotting each organism.

Rather than fitting a black-box model, the function is built on three transparent components:

  • Sighting frequency - which months and hours have historically produced the most sightings?
  • Weather favourability - what conditions were present on sighting days vs non-sighting days?
  • Data confidence - how much data backs up each recommendation?

Before building the function, we first explore the data carefully because raw occurrence records contain quality issues that meaningfully affect any prediction.

Data Quality

Before building any predictive function, we need to understand what the data actually contains. Two issues stand out immediately.

Code
# look at hour distributions for each organism
# this tells us when? sightings are recorded
hour_summary <- bind_rows(
  glowworms |> mutate(organism = "Glowworm"),
  gouldian_finch |> mutate(organism = "Gouldian Finch"),
  manta_rays |> mutate(organism = "Manta Ray"),
  orchids |> mutate(organism = "Orchid")
) |>
  filter(!is.na(hour)) |>          # remove missing hours
  count(organism, hour) |>         # count sightings per hour per organism
  group_by(organism) |>
  mutate(pct = n / sum(n) * 100)   # convert to percentage

# plot hour distributions
hour_summary |>
  ggplot(aes(x = hour, y = pct, fill = organism)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~organism, scales = "free_y") +   # separate panel per organism
  labs(
    x = "Hour of Day",
    y = "% of Sightings",
    title = "When are sightings recorded?",
    subtitle = "Unusual patterns reveal data quality issues"
  ) +
  theme_minimal(base_size = 12)

Issue 1 : Machine observations at hour 0

Manta rays have a large spike at hour 0 (midnight). This is not real behaviour , it reflects automated tracking devices that default to midnight when no exact timestamp is available.

Code
# show the record type breakdown for manta rays
manta_rays |>
  group_by(record_type) |>
  summarise(
    n = n(),                                    # total records
    pct_at_hour_0 = mean(hour == 0) * 100,      # % recorded at midnight
    median_hour = median(hour, na.rm = TRUE)     # typical hour
  )
# A tibble: 3 × 4
  record_type             n pct_at_hour_0 median_hour
  <chr>               <int>         <dbl>       <int>
1 HUMAN_OBSERVATION     273          5.86          11
2 MACHINE_OBSERVATION   675        100              0
3 OBSERVATION             5          0             12

Issue 2 : The Glowworm dual peak

Glowworms show something more interesting, two distinct peaks: one in the afternoon (around 15:00) and one in the evening (19:00–22:00).

This is not a data error. It reflects two genuinely different spotting contexts:

  • Cave tourism - organised tours where artificial darkness makes glowworms visible regardless of outside light
  • Wild sightings - natural bioluminescence only visible after dark

Our function will acknowledge this distinction rather than treating all sightings as equivalent.

Code
# show the two contexts in glowworm data
glowworms |>
  filter(!is.na(hour), hour != 0) |>
  mutate(
    context = case_when(
      hour >= 19 | hour <= 4  ~ "Wild (evening/night)",   # natural sightings
      hour >= 12 & hour <= 18 ~ "Cave tourism (daytime)", # organised tours
      TRUE                    ~ "Morning"
    )
  ) |>
  count(obs_state, context) |>     # breakdown by state and context
  arrange(obs_state, context)
# A tibble: 9 × 3
  obs_state       context                    n
  <chr>           <chr>                  <int>
1 New South Wales Cave tourism (daytime)     3
2 New South Wales Morning                    4
3 New South Wales Wild (evening/night)      14
4 Queensland      Cave tourism (daytime)    13
5 Queensland      Morning                    8
6 Queensland      Wild (evening/night)      39
7 Tasmania        Cave tourism (daytime)    18
8 Tasmania        Morning                    9
9 Tasmania        Wild (evening/night)      13

The Function

predict_best_times() takes an occurrence dataset and weather data and returns the top five month × time-of-day combinations for spotting an organism.

How it works

The function combines three scores for every possible month × hour combination:

  1. Frequency score - how often has this month + hour produced sightings historically? (40% weight)
  2. Weather score - how similar are typical conditions in this month to conditions on actual sighting days? (20% weight)
  3. Confidence score - how much data backs this recommendation? Honest about sparse cells. (40% weight)

These combine into a final score between 0 and 1. Higher = better time to visit.

Data cleaning inside the function

Before scoring, the function automatically:

  • Removes MACHINE_OBSERVATION records (all recorded at hour 0)
  • Removes hour = 0 from human observations (likely missing timestamps)
  • Separates glowworm sightings into cave vs wild context
Code
predict_best_times <- function(occurrence_data,
                               weather_data,
                               organism_name,
                               context = "all",   # "all", "cave", "wild"
                               n = 5) {           # number of results to return
  
  # Step 1: Clean data
  
  # remove machine observations — they default to hour 0
  occ <- occurrence_data |>
    filter(record_type != "MACHINE_OBSERVATION")
  
  # remove hour 0-likely missing timestamps not genuine midnight
  occ <- occ |>
    filter(!is.na(hour), hour != 0)
  
  # for glowworms — split by context if requested
  if (organism_name == "glowworms" && context != "all") {
    if (context == "cave") {
      # daytime sightings = cave tourism context
      occ <- occ |> filter(hour >= 9, hour <= 18)
      message("Using cave tourism context (9am-6pm sightings)")
    } else if (context == "wild") {
      # evening/night sightings = wild bioluminescence context
      occ <- occ |> filter(hour >= 19 | hour <= 4)
      message("Using wild sighting context (7pm-4am sightings)")
    }
  }
  
  #Step 2: Frequency scores
  
  # how often is each month recorded?
  month_freq <- occ |>
    count(month, name = "n_month") |>
    mutate(month_score = n_month / sum(n_month))  # normalise to 0-1
  
  # how often is each hour recorded?
  hour_freq <- occ |>
    count(hour, name = "n_hour") |>
    mutate(hour_score = n_hour / sum(n_hour))     # normalise to 0-1
  
  #Step 3: Weather favourability
  
  # join occurrence dates with weather
  occ_weather <- occ |>
    select(date, month, hour) |>
    left_join(
      weather_data |> select(ws_id, date, temp, rainy, wind_speed, prcp),
      by = "date" ,                    # match by date
      relationship = "many-to-many"    # multiple stations per date expected
    ) |>
    filter(!is.na(temp))                          # remove missing weather
  
  # what were average conditions on sighting days?
  ideal_conditions <- occ_weather |>
    summarise(
      ideal_temp  = mean(temp,       na.rm = TRUE),
      ideal_rain  = mean(rainy,      na.rm = TRUE),
      ideal_wind  = mean(wind_speed, na.rm = TRUE)
    )
  
  # score each month by how close its avg weather is to ideal
  monthly_weather <- weather_data |>
    group_by(month) |>
    summarise(
      avg_temp  = mean(temp,       na.rm = TRUE),
      avg_rain  = mean(rainy,      na.rm = TRUE),
      avg_wind  = mean(wind_speed, na.rm = TRUE),
      .groups = "drop"
    ) |>
    mutate(
      # closeness to ideal - 1 = perfect match, 0 = very different
      weather_score = 1 - (
        0.5 * pmin(abs(avg_temp  - ideal_conditions$ideal_temp)  / 15, 1) +
        0.3 * pmin(abs(avg_rain  - ideal_conditions$ideal_rain),        1) +
        0.2 * pmin(abs(avg_wind  - ideal_conditions$ideal_wind)  / 10,  1)
      )
    ) |>
    select(month, weather_score)
  
  #Step 4: Confidence score 
  
  # count sightings per month x hour cell
  cell_counts <- occ |>
    count(month, hour, name = "n_cell")
  
  #Step 5: Build full grid and score
  
  # all possible month x hour combinations
  grid <- expand.grid(month = 1:12, hour = 1:23)
  
  results <- grid |>
    left_join(month_freq,      by = "month") |>
    left_join(monthly_weather, by = "month") |>
    left_join(hour_freq,       by = "hour")  |>
    left_join(cell_counts,     by = c("month", "hour")) |>
    replace_na(list(                              # cells with no data = 0
      month_score   = 0,
      weather_score = 0.5,
      hour_score    = 0,
      n_cell        = 0
    )) |>
    mutate(
      # composite score — weighted average
      composite = 0.40 * month_score +
                  0.20 * weather_score +
                  0.40 * hour_score,
      
      # confidence - exponential curve
      # 1 sighting = 0.18, 5 = 0.63, 10 = 0.86
      confidence = 1 - exp(-n_cell / 5),
      
      # final score - composite scaled by confidence
      final_score = composite * confidence,
      
      # readable labels
      month_name = month.name[month],
      time_label = sprintf("%02d:00", hour),
      period = case_when(
        hour < 12 ~ "Morning",
        hour < 17 ~ "Afternoon",
        TRUE      ~ "Evening/Night"
      ),
      
      # plain english confidence label
      confidence_label = case_when(
        confidence >= 0.63 ~ "High",
        confidence >= 0.18 ~ "Moderate",
        TRUE               ~ "Low"
      )
    ) |>
    arrange(desc(final_score)) |>
    slice_head(n = n) |>                          # top n results
    select(month_name, time_label, period,
           final_score, confidence, confidence_label, n_cell)
  
  results
}

Results

We now apply predict_best_times() to each of the four organisms and interpret what the predictions tell us ecologically.

Glowworms

Glowworms show two distinct spotting contexts. We run the function twice, once for each context: to show how recommendations differ.

Code
glow_all <- predict_best_times(
  glowworms, weather, "glowworms", context = "all"
)

glow_cave <- predict_best_times(
  glowworms, weather, "glowworms", context = "cave"
)

glow_wild <- predict_best_times(
  glowworms, weather, "glowworms", context = "wild"
)

glow_all
  month_name time_label        period final_score confidence confidence_label
1   December      15:00     Afternoon   0.3102311  0.8891968             High
2   December      22:00 Evening/Night   0.3077742  0.8347011             High
3   December      11:00       Morning   0.1812007  0.5506710         Moderate
4   November      20:00 Evening/Night   0.1769426  0.5506710         Moderate
5      March      19:00 Evening/Night   0.1731293  0.6321206             High
  n_cell
1     11
2      9
3      4
4      4
5      5
Code
glow_cave
  month_name time_label        period final_score confidence confidence_label
1   December      15:00     Afternoon   0.3988595  0.8891968             High
2   December      11:00       Morning   0.2210959  0.5506710         Moderate
3   December      17:00 Evening/Night   0.1375388  0.3296800         Moderate
4   November      09:00       Morning   0.1335599  0.4511884         Moderate
5   December      16:00     Afternoon   0.1246102  0.3296800         Moderate
  n_cell
1     11
2      4
3      2
4      3
5      2
Code
glow_wild
  month_name time_label        period final_score confidence confidence_label
1   December      22:00 Evening/Night   0.3209724  0.8347011             High
2      March      19:00 Evening/Night   0.2149454  0.6321206             High
3   November      20:00 Evening/Night   0.2014099  0.5506710         Moderate
4   November      22:00 Evening/Night   0.1650238  0.4511884         Moderate
5       July      20:00 Evening/Night   0.1580352  0.5506710         Moderate
  n_cell
1      9
2      5
3      4
4      3
5      4

All contexts

Code
glow_all <- predict_best_times(glowworms, weather, "glowworms", context = "all")
glow_all
  month_name time_label        period final_score confidence confidence_label
1   December      15:00     Afternoon   0.3102311  0.8891968             High
2   December      22:00 Evening/Night   0.3077742  0.8347011             High
3   December      11:00       Morning   0.1812007  0.5506710         Moderate
4   November      20:00 Evening/Night   0.1769426  0.5506710         Moderate
5      March      19:00 Evening/Night   0.1731293  0.6321206             High
  n_cell
1     11
2      9
3      4
4      4
5      5

Cave tourism context

Code
glow_cave <- predict_best_times(glowworms, weather, "glowworms", context = "cave")
glow_cave
  month_name time_label        period final_score confidence confidence_label
1   December      15:00     Afternoon   0.3988595  0.8891968             High
2   December      11:00       Morning   0.2210959  0.5506710         Moderate
3   December      17:00 Evening/Night   0.1375388  0.3296800         Moderate
4   November      09:00       Morning   0.1335599  0.4511884         Moderate
5   December      16:00     Afternoon   0.1246102  0.3296800         Moderate
  n_cell
1     11
2      4
3      2
4      3
5      2

Wild sighting context

Code
glow_wild <- predict_best_times(glowworms, weather, "glowworms", context = "wild")
glow_wild
  month_name time_label        period final_score confidence confidence_label
1   December      22:00 Evening/Night   0.3209724  0.8347011             High
2      March      19:00 Evening/Night   0.2149454  0.6321206             High
3   November      20:00 Evening/Night   0.2014099  0.5506710         Moderate
4   November      22:00 Evening/Night   0.1650238  0.4511884         Moderate
5       July      20:00 Evening/Night   0.1580352  0.5506710         Moderate
  n_cell
1      9
2      5
3      4
4      3
5      4

The two contexts produce meaningfully different recommendations. Cave tourism sightings peak in the afternoon when organised tours run, regardless of outside light. Wild sightings peak in the evening and night, when natural bioluminescence is visible. A tourist planning a cave tour and a wildlife photographer planning a night walk would therefore benefit from different advice.

Gouldian Finch

The Gouldian Finch is an endangered bird found in the Top End of Australia. During the dry season (June to October) surface water contracts and birds gather at remaining waterholes, making them easier to spot.

Code
finch_preds <- predict_best_times(
  gouldian_finch, weather, "gouldian_finch"
)

print(finch_preds)
  month_name time_label  period final_score confidence confidence_label n_cell
1  September      07:00 Morning   0.3366792          1             High    117
2        May      07:00 Morning   0.3281704          1             High     94
3     August      07:00 Morning   0.3244448          1             High    135
4  September      06:00 Morning   0.3235242          1             High    124
5       July      07:00 Morning   0.3229981          1             High    126

Dry season months dominate as expected, the ecological mechanism is clear. High confidence scores reflect the richness of the Gouldian Finch dataset compared to other organisms.

Manta Rays

We use only human observations and machine observations are automatically excluded by the function as they default to hour 0 and do not reflect real behaviour.

Code
manta_preds <- predict_best_times(
  manta_rays, weather, "manta_rays"
)

print(manta_preds)
  month_name time_label        period final_score confidence confidence_label
1       June      11:00       Morning   0.3966887  0.9954834             High
2       June      22:00 Evening/Night   0.3453439  0.9666267             High
3       June      09:00       Morning   0.3434515  0.9257264             High
4       June      08:00       Morning   0.2680159  0.7534030             High
5       June      17:00 Evening/Night   0.2368578  0.6988058             High
  n_cell
1     27
2     17
3     13
4      7
5      6

Morning hours dominate which is consistent with divers and snorkellers operating during daylight. Winter months rank highest, reflecting feeding aggregations during cooler water temperatures.

Orchids

Orchids represent the largest dataset with over 35,000 records. Spring months (September to November) are peak flowering season especially in Western Australia.

Code
orchid_preds <- predict_best_times(
  orchids, weather, "orchids"
)

print(orchid_preds)
  month_name time_label    period final_score confidence confidence_label
1  September      11:00   Morning   0.4146638          1             High
2  September      10:00   Morning   0.4107552          1             High
3  September      14:00 Afternoon   0.4099291          1             High
4  September      12:00 Afternoon   0.4098058          1             High
5  September      13:00 Afternoon   0.4091029          1             High
  n_cell
1   1648
2   1529
3   1541
4   1514
5   1501

Spring months dominate with mid-morning hours when pollinators are most active. High confidence throughout reflects the data richness of the orchid dataset.

Visualising the Scoring Grid

The tables above show the top 5 results but the full month × hour scoring grid reveals the complete picture which combinations score well, which score poorly, and where the data is sparse.

Code
# modified version that returns full grid for visualisation
predict_full_grid <- function(occurrence_data,
                              weather_data,
                              organism_name,
                              context = "all") {
  
  # same cleaning as main function
  occ <- occurrence_data |>
    filter(record_type != "MACHINE_OBSERVATION") |>
    filter(!is.na(hour), hour != 0)
  
  if (organism_name == "glowworms" && context != "all") {
    if (context == "cave") {
      occ <- occ |> filter(hour >= 9, hour <= 18)
    } else if (context == "wild") {
      occ <- occ |> filter(hour >= 19 | hour <= 4)
    }
  }
  
  month_freq <- occ |>
    count(month, name = "n_month") |>
    mutate(month_score = n_month / sum(n_month))
  
  hour_freq <- occ |>
    count(hour, name = "n_hour") |>
    mutate(hour_score = n_hour / sum(n_hour))
  
  occ_weather <- occ |>
    select(date, month, hour) |>
    left_join(
      weather_data |> select(date, temp, rainy, wind_speed),
      by = "date"
    ) |>
    filter(!is.na(temp))
  
  ideal_conditions <- occ_weather |>
    summarise(
      ideal_temp = mean(temp,       na.rm = TRUE),
      ideal_rain = mean(rainy,      na.rm = TRUE),
      ideal_wind = mean(wind_speed, na.rm = TRUE)
    )
  
  monthly_weather <- weather_data |>
    group_by(month) |>
    summarise(
      avg_temp = mean(temp,       na.rm = TRUE),
      avg_rain = mean(rainy,      na.rm = TRUE),
      avg_wind = mean(wind_speed, na.rm = TRUE),
      .groups  = "drop"
    ) |>
    mutate(
      weather_score = 1 - (
        0.5 * pmin(abs(avg_temp - ideal_conditions$ideal_temp) / 15, 1) +
        0.3 * pmin(abs(avg_rain - ideal_conditions$ideal_rain),       1) +
        0.2 * pmin(abs(avg_wind - ideal_conditions$ideal_wind) / 10,  1)
      )
    ) |>
    select(month, weather_score)
  
  cell_counts <- occ |>
    count(month, hour, name = "n_cell")
  
  grid <- expand.grid(month = 1:12, hour = 1:23)
  
  grid |>
    left_join(month_freq,      by = "month") |>
    left_join(monthly_weather, by = "month") |>
    left_join(hour_freq,       by = "hour")  |>
    left_join(cell_counts,     by = c("month", "hour")) |>
    replace_na(list(
      month_score   = 0,
      weather_score = 0.5,
      hour_score    = 0,
      n_cell        = 0
    )) |>
    mutate(
      composite   = 0.40 * month_score +
                    0.20 * weather_score +
                    0.40 * hour_score,
      confidence  = 1 - exp(-n_cell / 5),
      final_score = composite * confidence,
      month_name  = factor(month.abb[month], levels = month.abb)
    )
}

Heatmap

Code
# build full grids for all four organisms
grids <- bind_rows(
  predict_full_grid(glowworms,      weather, "glowworms")      |> mutate(organism = "Glowworm"),
  predict_full_grid(gouldian_finch, weather, "gouldian_finch") |> mutate(organism = "Gouldian Finch"),
  predict_full_grid(manta_rays,     weather, "manta_rays")     |> mutate(organism = "Manta Ray"),
  predict_full_grid(orchids,        weather, "orchids")         |> mutate(organism = "Orchid")
)

# heatmap- x->month, y->hour, colour=final score
grids |>
  ggplot(aes(
    x    = month_name,
    y    = factor(hour),
    fill = final_score
  )) +
  geom_tile(colour = "white", linewidth = 0.3) +  # white grid lines
  scale_fill_viridis_c(
    option="magma",
    name = "Score",
    na.value = "#1a1a2e"
  ) +
  facet_wrap(~organism, ncol = 2) +               # 2x2 grid of organisms
  labs(
    x        = NULL,
    y        = "Hour of Day",
    title    = "Wildlife Spotting Opportunity Score",
    subtitle = "Lighter/warmer = better time to visit · Dark navy = no historical data"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid  = element_blank(),
    legend.position = "bottom",
    strip.text  = element_text(face = "bold")
  )

Limitations and Future Improvements

No predictive function is complete without an honest account of what it cannot do. Several limitations are worth noting.

Data constraints

Observer effort is unaccounted for. More sightings on weekends or in popular tourist months may reflect more people looking, not more animals present. The function cannot distinguish between “this organism is more active in September” and “more people go wildlife spotting in September.”

Weather is historical, not forecast. The function scores months by average historical weather conditions, not actual upcoming forecasts. Integrating a live weather API such as Open-Meteo would give more precise short-term recommendations.

Sparse data affects confidence. Glowworms have only 124 records. Recommendations for sparse organisms should be treated as indicative rather than definitive, the confidence score communicates this honestly.

Modelling constraints

The function scores month × hour combinations independently. It does not model interactions for example, whether warm temperatures matter more in certain months than others.

Spatial variation is ignored. The function treats all sightings of an organism as equivalent regardless of location. A glowworm sighting in Tasmania and one in Queensland reflect very different ecological contexts as the dual-peak analysis showed.

Natural extensions

These limitations point toward natural extensions that could form part of the full GSoC 2026 project scope:

  • Integrate a weather forecast API for real-time recommendations
  • Add a spatial component - recommend specific regions not just months and times
  • Account for observer effort using visit-based occupancy models
  • Build a zero-inflated model for sparse organisms like glowworms

Finishing Up

This document developed predict_best_times() - a transparent, confidence-aware function for recommending wildlife spotting opportunities. The key contributions are:

  • Honest confidence scoring that reflects data sparsity
  • Automatic data cleaning for known quality issues
  • Context-aware predictions for glowworms (cave vs wild)
  • A visual scoring grid showing the full month × hour landscape

The glowworm dual-peak finding: daytime cave tourism vs evening wild sightings - emerged directly from careful data exploration before any modelling, and produced genuinely different recommendations depending on what kind of spotting experience a visitor is planning.