Education

Slides & Tutorials

Lecture Slides (in Portuguese)

Introdução ao Aprendizado de Máquina
Regressões Lineares e Não-Lineares
Florestas Aleatórias (Random Forests)

Redes Neurais Artificiais (Artificial Neural Networks)
Máquinas de Vetores de Suporte (Support Vector Machines)
Validação de Modelos

R Tutorials (in Portuguese)

Click here to download the tutorials (ZIP file)

The ZIP file includes the following tutorials:

Tutorial 1. Introdução ao R
Tutorial 2. Regressões
Tutorial 3. Florestas Aleatórias

Tutorial 4.A Redes Neurais (com Torch)
Tutorial 4.B Redes Neurais (com Keras)
Tutorial 5. Support Vector Machines (SVM)
Tutorial 6. Validação de Modelos

Deforestation Data

Click here to download the deforestation data

This dataset was constructed to be part of the machine learning tutorials. It is focused on the 2004 deforestation in the Brazilian Amazon, measured at the municipality level, and related socioeconomic and biophysical factors. It includes 808 municipalities and 31 variables. The description of the variables is presented below:

codigo : unique municipality ID
state : state acronym where the municipality is located
def_annMB : deforestation in 2004 based on the MapBiomas data (km2)
area_km2 : size of the municipality (km2)
PAs : municial area covered by protected areas (1000 km2)
env_fine_cancel : sum of environmental fines cancelled (R$)
env_fine : sum of environmental fines issued (R$)
env_fine_paid : sum of environmental fines issued (R$)
dist_ports : Euclidean distance to major ports
dist_manaus : Euclidean distance to the Manus port
dist_parana : Euclidean distance to the Paraná port
dist_arc : Euclidean distance from the Arc of Deforestation region
dist_capital : Euclidean distance to state capitals
dist_seat : Euclidean distance to municipal seats
incra_ha : INCRA settlement cover (ha)
incra_family : number of families in INCRA settlements
incra_cap : INCRA settlement capacity in terms of families
gdp : gross domestic product (R$)
gdp_agr : gross domestic product from the agricultural sector (R$)
pop : population count
suitability_soy : suitability index for soybean production (0 to 1)
suitability_pas : suitability index for pasture (0 to 1)
soil : soil quality (ranging from 1 - poor to 5 - excellent)
flo2000 : forest cover in 2000 (km2)
road_density : road density (km/km2)
road_hway_density : road density (km/km2)
road_hway_km : road density (km2)
road_km : road density (km2)
mayor_party : mayor party political alignment (left/center/right)
gov_party : governor party political alignment (left/center/right)
def_category : dummy indicating the top 25% municipalities with the highest deforestation in 2004 (=1) or not (=0)

Some of my most used pieces of R code

For matching and copying a variable from dataframe d2 into dataframe d1:

d1$variable <- d2[match(with(d1, ID), with(d2, ID)),]$variable

To compute a specific statistic per category (sum per year, in this example):

library(plyr)

annual_sum <- ddply(.data = d1, .(year), .fun = summarise, total = sum(variable))