Land Development Risk

As each day passes, the amount of space on earth available to naturally store carbon decreases. Land development efforts compete and create conflict with Net Zero goals, yet a growing population necessitates more housing and more energy. The planet's future depends on sustainable and climate-smart land management that balances growth with ecosystem resilience. Carbon emissions, food security, timber supply, water management, clean air, and biodiversity all depend on protecting land with valuable natural capital from increasing development pressures. In the absence of proactive work to protect these lands, unsustainable land development may irreversibly deplete these valuable resources. Identifying the parcels of land that are most likely to be developed enables the proactive land management needed for a climate-positive future.

Leveraging AI to Identify At-Risk Land

Development risk is defined as the desirability of a piece of land for real estate development, including commercial, industrial, and residential purposes. Such development activities can permanently erase nature-based sources of carbon storage and removal while degrading the natural capital and climate resilience of a region without proper mitigation planning. Development risk is only worth knowing before a piece of land gets developed, and can be leveraged strategically before market factors substantially affect land prices. To measure this, Upstream Carbon has developed an ensemble of machine learning models to produce a Greenfield Index. A parcel with a Greenfield Index close to 1 faces high development pressures, and a parcel with a Greenfield Index close to 0 faces low development pressures.

Data Sources

This model uses publicly available datasets. These datasets are combined and prepared by the Upstream Carbon team to train the machine learning models:

Parcel, ownership, and use code data: MassGIS
Census Data: 2016 ACS block group data retrieved from Google BigQuery
Road and landmark data: OpenStreetMap API
Elevation data: USGS 1 arc-second Digital Elevation Map
Conservation Data: National Conservation Easement Database
Wetlands Polygons: National Wetlands Inventory
Soil Data: NRCS SSURGO-certified soils data for Massachusetts and National Land Cover Database (NLCD) for 2021

Map Layers

MA Wetlands: Wetland polygons (see above)
Protected Land: Parcels with Conservation Restrictions from MA tax data; polygons from the National Conservation Easement Database; or parcels currently zoned as undevelopable
Municipal Land: Land owned by town or state government
Current Use Land: Land enrolled in chapter 61 (MA current use) from MA tax data
Greenfield Properties: Undeveloped, vacant parcels under private ownership, colored by development probability
Aboveground Carbon: Current (2020) estimated carbon pool density in grams per meter

What are the highest predictors of development risk?

Factors with simple relationships

The highest predictors of development risk largely pertain to the location of the land. The physical attributes of a piece of land and how close a piece of land is to local amenities indicates a higher risk that this land will be developed. The strongest predictors of development risk are:

Frontage: The amount of a parcel's boundary facing a road is the most important indicator of development risk. Higher frontage means higher development risk.
Nearby Developed Parcels: The percent of area surrounding a parcel that consists of developed properties is a significant indicator of that parcel's development risk. Higher nearby development means higher development risk.
Nearby Current Use Parcels: Many states operate a Current Use program that rewards landowners with property tax incentives for medium-term (e.g. 10-year) commitments to use the land for agriculture, forestry, or conservation. More nearby land in current use programs means lower development risk.
Distance to transportation infrastructure: Parcels closer to highways, primary roads, secondary roads, train stations, airports, and harbors have higher development risk.
Slope: Parcels with lower average slope (flatter land) and more uniform surface across the parcel are more likely to be developed than high-slope parcels or "bumpy" parcels.
Wetlands: Parcels with higher ratios of wetland area are less likely to be developed.
Soils: Parcels with higher proportions of high quality soil are more likely to be developed.
Electrical Infrastructure: Parcels closer to substations and transmission lines are more likely to be developed.
Last sale date: The more recent the last sale date on a piece of land is, the higher likelihood that land is likely to be developed
Population and Economics: Parcels of areas with high population density are more likely to be developed, as well as parcels in areas of higher income per capita.

Model Details

This prediction model is a blended model which uses a proprietary training dataset curated by Upstream Carbon. The team at Upstream Carbon leveraged automated machine learning to optimize across modeling approaches and ensure outcomes are risk adjusted based on customer needs. This model is updated as updated parcel data are published, which currently happens every 6 months.

Data for model training were sampled in a manner that increased the sample size of large-acre parcels (which are of greater interest for conservation planning), ensured sampling was proportional to the area of counties in which parcels were located, and distributed sampling across all counties. The target for training was set to 1 for parcels where use codes indicated development, and 0 where use codes indicated undeveloped, vacant areas. Additional care was taken to exclude already-conserved and municipal-owned land from the training set. The model was then trained using a five-fold cross validation approach with a 20% holdout, where data was partitioned to ensure that the model would be trained on parcels from one set of municipalities and validated on a separate set of municipalities. The model was trained to optimize for LogLoss, weighted to reward accurate predictions on larger parcels, producing a weighted LogLoss score of 0.583 and a weighted AUC score of 0.745 on cross validation test sets. When ranked by prediction (from 0 to 1), the highest 10% probability parcels averaged a development rate of 0.943 against a predicted development rate of 0.937; the lowest 10% probability parcels averaged a development rate of 0.329 against a predicted development rate of 0.352. Additional accuracy and performance reporting can be made available on request to matt@upstreamcarbon.com.

The current version of the model is an ensemble model of XG Boost with Early Stopping, LightGBM on ElasticNet Predictions, and Nystroem Kernel SVM Classifier algorithms. Prediction explanations displayed in the application are produced using Shapely Additive Explanations (SHAP). The predictions are then compared with other metrics of interest (e.g. Carbon, Soil Quality) to measure the amount of natural capital at risk of loss.

Modeling Beyond Development Risk

Upstream Carbon's goal is to arm every organization for climate-smart and sustainable land management. Development risk is just one metric, and there are always more metrics to consider. Our team is excited to evaluate modeling and analytics opportunities with our customers beyond the current scope of the platform. Requests for custom models, comparison metrics, or other improvements can be sent to matt@upstreamcarbon.com.