The packages we’ll need:
https://d-rug.github.io/using-geoparquet-R
01:00
.parquet
file? (and why bother!?)The parquet
format is a file type that contains a table inside similar to a .csv
However these files are stored in binary form not as plain text
parquet
files are column-oriented (unlike csv) and each column is stored independently
parquet
embeds the schema or data types/structure within the data itself
{sf}
and arrow
parquet
files aren’t just about compression (though some savings here)
The real benefit is speed reading/operating/writing data!
nhdplusTools
} to get HUC12s for CA 1parquet
file with {sf
} class# create dir to save data
fs::dir_create("data")
# get CA and download HUC12s
ca <- tigris::states(progress_bar=FALSE) %>% filter(NAME=="California")
# can specify any given option for huc8, huc10,etc
huc12 <- nhdplusTools::get_huc(ca, type = "huc12") # this takes a minute or two
huc12 <- sf::st_cast(huc12, "MULTIPOLYGON") # fix geometry
# save out
geoarrow::write_geoparquet(huc12, here::here("data/nhd_huc12.parquet"))
geoson.zip
is 466MB, unzipped 3.6GB)library(tictoc)
tic()
df <- geojsonsf::geojson_sf("~/Downloads/California.geojson")
toc()
# 257.514 sec elapsed
# write it out
tic()
geoarrow::write_geoparquet(df, here::here("data/ca_bing_buildings.parquet"))
toc()
# writes in 11.745 seconds!
# read it back in
tic()
df_parq <- geoarrow::read_geoparquet_sf(here::here("data/ca_bing_buildings.parquet"))
toc()
# read in 14.724 seconds!
library(tigris)
# get building data
df_parq <- geoarrow::read_geoparquet_sf(here::here("data/ca_bing_buildings.parquet"))
cnty <- counties(state="CA")
sel_cnty <- cnty %>% filter(NAME=="Butte") %>% st_transform(4326)
st_crs(df_parq)$epsg
st_crs(sel_cnty)$epsg
# crop
sf_use_s2(FALSE)
sel_df <- df_parq[sel_cnty,]
sel_df <- geoarrow::read_geoparquet_sf(here::here("data/ca_bing_buildings_butte.parquet"))
# turn off s2
sf_use_s2(FALSE)
# add area
sel_df <- sel_df %>%
mutate(build_area = st_area(geometry))
# sort by largest and select top 5% largest
sel_df_lrg <- sel_df %>%
arrange(build_area) %>%
slice_max(order_by = build_area, prop = 0.05)
Buildings in the top 5% for area in the county