Data Cleaning – Project: Barcelona – Real Estate

Python

Pandas, Requests, Beautiful Soup


JUPYTER LAB – Barcelona Real Estate Market – Property Rent Data



BCN_REAL_ESTATE_API_MARKDOWN



JUPYTER LAB

BARCELONA: Real Estate Market – Property Rent

METHODOLOGY

DATA

Data can be accessed via APIs. Data is stored by quarter. Each quarter has its own API and contains:

  • data aggregated at the city level
  • data aggregated at the neighbourhood level (for all 73 neighbourhoods comprising the city)

Relevant data to extract is at the Neighbourhood level – 73 neighburhoods in total, in particular:

  • DISTRICT_CODE
  • DISTRICT_NAME
  • NEIGHBOURHOOD_CODE
  • NEIGHBOURHOOD_NAME
  • RENT_QUARTER_DATE: datetype added by merging RENT_YEAR and RENT_QUARTER
  • RENT_YEAR
  • RENT_QUARTER
  • RENT_PRICE_EUROS_x_MQ: average rent per square metre by neighbourhood
  • RENT_PRICE_EUROS_x_MONTH: average rent per month by neighbourhood

Source: Adjuntamiento de Barcelona – https://opendata-ajuntament.barcelona.cat/data/es/dataset/est-mercat-immobiliari-lloguer-mitja-mensual

PROCESS

API

APIs are constructed and added to a list to be iterated later

DATA NORMALISATION:

District and Neighbourhood codes need to have a ‘0’ in front of all codes up to 10.

DATA EXPORT TO EXCEL

Results are exported onto an Excel file. The prefix enables to differenciate the file and needs to be set at the beginning. The full name of the file, the sheet_name and the index_label are assigned automatically. 

SEQUENCE

1) API
2) DATA NORMALISATION
3) DATA EXPORT TO EXCEL



BCN_REAL_ESTATE_API_CODE_QUARTER



JUPYTER LAB

BARCELONA: Real Estate Market – Property Rent

REQUIRED LIBRARIES

In [17]:

import pandas as pd
import requests
Average monthly rent (€/month) and per surface (€/m2) of the city of Barcelona
CREATE LIST OF RESOURCE IDS

In [18]:

#RESOURCE IDs:
R_ID_2022 = 'e96bf614-467b-40ab-91b9-e48a616ea775'
R_ID_2021 = 'cfc45f2b-62eb-4621-8486-1b90e36b4bfe'
R_ID_2020 = '47c9d64d-317a-45d0-8c45-45488df8601c'
R_ID_2019 = '004c76b1-6269-4136-89b2-89fd47046930'
R_ID_2018 = '3dc45b16-42a9-4f57-9863-e6d1a4f5869f'
R_ID_LIST = [R_ID_2022,R_ID_2021,R_ID_2020,R_ID_2019,R_ID_2018]
SET PREFIX AND PERIOD TO NAME COLUMNS AND EXCEL EXPORT FILE

In [19]:

#PREFIX for COLUMN NAMES
col_prefix = 'RENT'
#SET PERIOD for EXCEL FILE NAME
start_year = 2018
end_year = 2022

In [23]:

#NAME THE EXPORT EXCEL FILE
excel_file_name = 'T_REAL_ESTATE_{}_YEAR_QUARTER_{}-{}.xlsx'.format(col_prefix,end_year,start_year)
excel_sheet_name = '{}'.format(col_prefix)
excel_index_label = '{}_INDEX'.format(col_prefix)
#NAME COLUMNS
col_DISTRICT_CODE = '{}_DISTRICT_CODE'.format(col_prefix)
col_DISTRICT_NAME = '{}_DISTRICT_NAME'.format(col_prefix)
col_NEIGHBOURHOOD_CODE = '{}_NEIGHBOURHOOD_CODE'.format(col_prefix)
col_NEIGHBOURHOOD_NAME = '{}_NEIGHBOURHOOD_NAME'.format(col_prefix)
col_QUARTER_DATE = '{}_QUARTER_DATE'.format(col_prefix)
col_YEAR = '{}_YEAR'.format(col_prefix)
col_QUARTER = '{}_QUARTER'.format(col_prefix)
col_PRICE_EUROS_x_MONTH = '{}_PRICE_EUROS_x_MONTH'.format(col_prefix)
col_PRICE_EUROS_x_MQ = '{}_PRICE_EUROS_x_MQ'.format(col_prefix)
column_order = [col_DISTRICT_CODE,col_DISTRICT_NAME,col_NEIGHBOURHOOD_CODE,col_NEIGHBOURHOOD_NAME,col_YEAR,col_QUARTER,col_PRICE_EUROS_x_MQ,col_PRICE_EUROS_x_MONTH]
#CREATE API URL List:
API_URL_LIST = []
for R_ID in R_ID_LIST:
    url = 'https://opendata-ajuntament.barcelona.cat/data/api/action/datastore_search_sql?sql=SELECT%20*%20from%20%22{}%22'.format(R_ID)
    API_URL_LIST.append(url)
#CREATE A FINAL DATAFRAME WHERE TO CONCAT TOGETHER ALL SINGLE URL DATAFRAMES
df_final = pd.DataFrame()
#API
for url in API_URL_LIST:
    response = requests.get(url)
    if response.ok:
        data = response.json()
        result = data.get('result')
        records = result.get('records')
    else:
        print('Problem with: ', url)
    df = pd.DataFrame.from_dict(records)
    df = df.drop(df.columns[[3,8]],axis = 1)
    df_EUROS_MONTH = df.loc[df['Lloguer_mitja'] == 'Lloguer mitjà mensual (Euros/mes)']
    df_EUROS_MONTH = df_EUROS_MONTH.rename(columns={'Preu':'Preu_Euros_Month'})
    df_EUROS_MQ = df.loc[df['Lloguer_mitja'] == 'Lloguer mitjà per superfície (Euros/m2 mes)']
    df_merge = pd.merge(df_EUROS_MQ, df_EUROS_MONTH[['Codi_Barri','Any','Preu_Euros_Month']],on=['Codi_Barri','Any'], how='inner')
    df_merge = df_merge.drop(columns=['Lloguer_mitja'])
    #RENAME COLUMNS
    df_merge.columns = [col_DISTRICT_CODE, col_DISTRICT_NAME, col_QUARTER, col_PRICE_EUROS_x_MQ, col_NEIGHBOURHOOD_NAME, col_NEIGHBOURHOOD_CODE, col_YEAR, col_PRICE_EUROS_x_MONTH]
    #REORDER COLUMNS
    df_concat = df_merge[column_order]
    #CONCAT TO FINAL DATAFRAME
    df_final = pd.concat([df_final,df_concat], ignore_index=True)
#DISTRICT AND NEIGHBOURHOOD CODES: INSERT '0' IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10
df_final[[col_DISTRICT_CODE]] = df_final[[col_DISTRICT_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_final[[col_NEIGHBOURHOOD_CODE]] = df_final[[col_NEIGHBOURHOOD_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
#REPLACE NA AND - WITH EMPTY CELL
df_final[col_PRICE_EUROS_x_MQ] = df_final[col_PRICE_EUROS_x_MQ].str.replace('-','').str.replace('NA','')
df_final[col_PRICE_EUROS_x_MONTH] = df_final[col_PRICE_EUROS_x_MONTH].str.replace('-','').str.replace('NA','')
#RESET DATA TYPES
df_final[col_DISTRICT_CODE] = df_final[col_DISTRICT_CODE].astype('string')
df_final[col_DISTRICT_NAME] = df_final[col_DISTRICT_NAME].astype('string')
df_final[col_NEIGHBOURHOOD_CODE] = df_final[col_NEIGHBOURHOOD_CODE].astype('string')
df_final[col_NEIGHBOURHOOD_NAME] = df_final[col_NEIGHBOURHOOD_NAME].astype('string')
df_final[col_YEAR] = df_final[col_YEAR].astype('int64')
df_final[col_QUARTER] = df_final[col_QUARTER].astype('int64')
#INSERT DATE COLUMN DERIVED FROM YEAR AND QUARTER COLUMNS
df_final.insert(4, col_QUARTER_DATE, pd.PeriodIndex(year=df_final[col_YEAR], quarter=df_final[col_QUARTER]).to_timestamp())
#DATAFRAME EXPORT TO EXCEL
df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)

In [24]:

df_final

Out[24]:

RENT_DISTRICT_CODERENT_DISTRICT_NAMERENT_NEIGHBOURHOOD_CODERENT_NEIGHBOURHOOD_NAMERENT_QUARTER_DATERENT_YEARRENT_QUARTERRENT_PRICE_EUROS_x_MQRENT_PRICE_EUROS_x_MONTH
001Ciutat Vella01el Raval2022-01-012022115.5917.0
101Ciutat Vella01el Raval2022-01-012022115.5872.0
201Ciutat Vella01el Raval2022-04-012022214.7917.0
301Ciutat Vella01el Raval2022-04-012022214.7872.0
401Ciutat Vella02el Barri Gòtic2022-01-012022117.21181.6
495910Sant Martí73la Verneda i la Pau2018-07-012018311.42756.1
496010Sant Martí73la Verneda i la Pau2018-10-012018411.81715.37
496110Sant Martí73la Verneda i la Pau2018-10-012018411.81719.55
496210Sant Martí73la Verneda i la Pau2018-10-012018411.81767.1
496310Sant Martí73la Verneda i la Pau2018-10-012018411.81756.1

4964 rows × 9 columns

JUPYTER LAB – Barcelona Real Estate Market – Property Sale Data



BCN_REAL_ESTATE_WEBSCRAPING_MARKDOWN



JUPYTER LAB

BARCELONA: Real Estate Market – Property Sale

METHODOLOGY

DATA

Data is stored in html pages. Each page refers to a year and contains:

  • data aggregated at the city level
  • data aggregated at the neighbourhood level (for all 73 neighbourhoods comprising the city)

Relevant data to extract is at the Neighbourhood level – 73 neighburhoods in total, in particular:

  • DISTRICT_CODE
  • NEIGHBOURHOOD_CODE
  • NEIGHBOURHOOD_NAME
  • YEAR
  • NUMBER_NEW: number of newly built houses traded
  • NUMBER_PROTECTED: number of houses in protected regime traded
  • NUMBER_USED: number of preexisting houses traded
  • AVG_MQ_NEW: average size in metre squares of newly built houses traded
  • AVG_MQ_PROTECTED: average size in metre squares of houses in protected regime traded
  • AVG_MQ_USED: average size in metre squares of preexisting houses traded
  • PRICE_EUROS_X1000_NEW: average price in thousand euros of newly built houses traded
  • PRICE_EUROS_X1000_USED: average price in thousand euros of preexisting houses traded
  • PRICE_EUROS_MQ_NEW: average price in euros per metre square of newly built houses traded
  • PRICE_EUROS_MQ_USED: average price in euros per metre square of preexisting houses traded

Source: Adjuntamiento de Barcelona – https://ajuntament.barcelona.cat/estadistica/catala/Estadistiques_per_temes/Habitatge_i_mercat_immobiliari/Mercat_immobiliari/Compravenda_habitatges/index.htm

PROCESS

WEB SCRAPING

The first column of the html page contains merged data is this order:

  • District Code
  • Neighbourhood Code
  • Neighbourhood Name
    These codes and names are extracted and placed into separate columns for later reference in visualizations.
    This is done by replacing the 5 spaces xa0 between the two codes by with a to enable a column split to separate the codes into distinct columns.
    The
    character is later removed from the NEIGHBOURHOOD_NAME column.
    District and Neighbourhood codes need to have a ‘0’ in front of all codes up to 10.

DATA CLEANING, NORMALISATION AND TRANSLATION:

Data is cleaned of spaces (Xa0) and dots (.) in numbers during web scraping.

DATA EXPORT TO EXCEL

Results are exported onto an Excel file. The prefix enables to differenciate the file and needs to be set at the beginning. The full name of the file, the sheet_name and the index_label are assigned automatically. 

SEQUENCE

1) WEB SCRAPING
2) DATA CLEANING, NORMALISATION AND TRANSLATION
3) DISTRICT AND NEIGHBOURHOOD CODES: INSERT ‘0’ IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10
4) DATA EXPORT TO EXCEL



BCN_REAL_ESTATE_WEBSCRAPING_CODE_QUARTER



JUPYTER LAB

BARCELONA: Real Estate Market – Property Sale

REQUIRED LIBRARIES

In [1]:

import requests
import pandas as pd
from bs4 import BeautifulSoup
SET THE PERIOD

In [2]:

start_year = 2018
end_year = 2022
SET PREFIX FOR COLUMNS AND EXCEL FILE NAME

In [3]:

#PREFIX for COLUMN NAMES
col_prefix = 'SALE'
WEB SCRAPING AND DATA EXPORT

In [6]:

#TO NAME THE EXPORT EXCEL FILE
excel_file_name = 'T_REAL_ESTATE_{}_YEAR_QUARTER_{}-{}.xlsx'.format(col_prefix,end_year,start_year)
excel_sheet_name = '{}'.format(col_prefix)
excel_index_label = '{}_INDEX'.format(col_prefix)
#TO NAME COLUMNS
col_DISTRICT_CODE = '{}_DISTRICT_CODE'.format(col_prefix)
col_NEIGHBOURHOOD_CODE = '{}_NEIGHBOURHOOD_CODE'.format(col_prefix)
col_NEIGHBOURHOOD_NAME = '{}_NEIGHBOURHOOD_NAME'.format(col_prefix)
col_QUARTER_DATE = '{}_QUARTER_DATE'.format(col_prefix)
col_YEAR = '{}_YEAR'.format(col_prefix)
col_QUARTER = '{}_QUARTER'.format(col_prefix)
col_NUMBER_NEW = '{}_NUMBER_NEW'.format(col_prefix)
col_NUMBER_PROTECTED = '{}_NUMBER_PROTECTED'.format(col_prefix)
col_NUMBER_USED = '{}_NUMBER_USED'.format(col_prefix)
col_AVG_MQ_NEW = '{}_AVG_MQ_NEW'.format(col_prefix)
col_AVG_MQ_PROTECTED = '{}_AVG_MQ_PROTECTED'.format(col_prefix)
col_AVG_MQ_USED = '{}_AVG_MQ_USED'.format(col_prefix)
col_PRICE_EUROS_x_1000_NEW = '{}_EUROS_x_1000_NEW'.format(col_prefix)
col_PRICE_EUROS_x_1000_USED = '{}_EUROS_x_1000_USED'.format(col_prefix)
col_PRICE_EUROS_x_MQ_NEW = '{}_PRICE_EUROS_x_MQ_NEW'.format(col_prefix)
col_PRICE_EUROS_x_MQ_USED = '{}_PRICE_EUROS_x_MQ_USED'.format(col_prefix)
#TO CREATE A LIST OF YEARS
years = list(reversed(range(start_year,end_year+1,1)))
#CREATE LIST OF QUARTERS
quarters = list(reversed(range(1,5,1)))
#TO CREATE A FINAL DATAFRAME WHERE TO CONCAT TOGETHER ALL SINGLE URL DATAFRAMES
df_final = pd.DataFrame()
#WEBSCRAPING ALL SINGLE URLs AND CONCAT SINGLE URL DATAFRAMES TO FINAL DATAFRAME
for year in years:
    df_year =pd.DataFrame()
    for quarter in quarters:
        url ='https://ajuntament.barcelona.cat/estadistica/catala/Estadistiques_per_temes/Habitatge_i_mercat_immobiliari/Mercat_immobiliari/Compravenda_habitatges/a{}/t0{}02.htm'.format(year,quarter)
        #DISCART 4th AND 3rd QUARTERS AS DATA NOT AVAILABLE YET
        if (year == 2022 and quarter == 4) | (year == 2022 and quarter == 3):
            continue
        response = requests.get(url).text
        soup = BeautifulSoup(response, 'html.parser')
        tables = soup.find_all('tr')
        columns = []
        for table in tables:
            headers = table.find_all('td', class_='WhadsColVar1')
            for header in headers:
                column = header.getText().strip().replace('xa0xa0xa0xa0xa0','').replace('xa0','').replace(' ','')#trim spaces at the beginning and at the end and remove 5X spaces xa0 in between and other spaces within strings
                columns.append(column)
        data = []
        for table in tables:
            row =[]
            neighbourhoods = table.find_all('td', class_='WhadsRowVar3')
            for n in neighbourhoods:
                neighbourhood = n.getText().strip().replace('xa0xa0xa0xa0xa0','*')#trim spaces at the beginning and at the end and remove 5X spaces xa0 in between and replace with * to use later for split
                row.append(neighbourhood)
            values = table.find_all('td', class_='WhadsDades')
            for v in values:
                    value = v.getText().replace('.','').replace(',','.').replace('-','').replace('NA','').strip()#trim spaces xa0 and remove . dots from numbers (to avoid problems with number conventions later)
                    row.append(value)
            data.append(row)
        #SINGLE URL DATAFRAME
        df_quarter = pd.DataFrame(data, columns = columns)
        #insert columns at the left of the dataframe: NAT_DISTRICT_CODE, NAT_NEIGHBOURHOOD_CODE,NAT_NEIGHBOURHOOD_NAME
        df_quarter.insert(0,'col_NEIGHBOURHOOD_NAME', df_quarter['Dte.Barris'].astype(str).str.replace('*','',regex=True).str.replace('d+', '',regex=True).str.replace('.','',regex=True).str.lstrip())#remove the split character *, digits, dot, and space left at the beginning
        df_quarter[['split1','split2']] = df_quarter['Dte.Barris'].str.split('*',expand=True)
        df_quarter.insert(0,'col_NEIGHBOURHOOD_CODE', df_quarter['split2'].str.extract('(d+)'))
        df_quarter.insert(0,'col_DISTRICT_CODE',df_quarter['split1'])
        #ADD YEAR COLUMN
        df_quarter.insert(3,'col_YEAR', year)
        #ADD QUARTER COLUMN
        df_quarter.insert(4,'col_QUARTER', quarter)
        #keep columns of interest
        df_quarter.drop(['Dte.Barris','Total','','split1','split2'],axis=1, inplace=True)
        #keep rows of interest
        r1 = []
        for i in range(10,83):
            r1.append(i)
        df_quarter = df_quarter.iloc[r1,:]
        #RENAME COLUMNS
        df_quarter.columns = [
            col_DISTRICT_CODE,
            col_NEIGHBOURHOOD_CODE,
            col_NEIGHBOURHOOD_NAME,
            col_YEAR,
            col_QUARTER,
            col_NUMBER_NEW,
            col_NUMBER_PROTECTED,
            col_NUMBER_USED,
            col_AVG_MQ_NEW,
            col_AVG_MQ_PROTECTED,
            col_AVG_MQ_USED,
            col_PRICE_EUROS_x_1000_NEW,
            col_PRICE_EUROS_x_1000_USED,
            col_PRICE_EUROS_x_MQ_NEW,
            col_PRICE_EUROS_x_MQ_USED
            ]
        #CONCAT SINGLE QUARTER URL DATAFRAMES INTO MERGED YEAR DATAFRAME
        df_year = pd.concat((df_year,df_quarter), ignore_index=True)
    #CONCAT SINGLE QUARTER URL DATAFRAMES INTO MERGED YEAR DATAFRAME
    df_final = pd.concat((df_final,df_year), ignore_index=True)
#DISTRICT AND NEIGHBOURHOOD CODES: INSERT '0' IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10
df_final[[col_DISTRICT_CODE]] = df_final[[col_DISTRICT_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_final[[col_NEIGHBOURHOOD_CODE]] = df_final[[col_NEIGHBOURHOOD_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])   
#RESET DATA TYPES
df_final[col_DISTRICT_CODE] = df_final[col_DISTRICT_CODE].astype('string')
df_final[col_NEIGHBOURHOOD_CODE] = df_final[col_NEIGHBOURHOOD_CODE].astype('string')
df_final[col_NEIGHBOURHOOD_NAME] = df_final[col_NEIGHBOURHOOD_NAME].astype('string')
df_final[col_YEAR] = df_final[col_YEAR].astype('int64')
df_final[col_QUARTER] = df_final[col_QUARTER].astype('int64')
#INSERT DATE COLUMN DERIVED FROM YEAR AND QUARTER COLUMNS
df_final.insert(3, col_QUARTER_DATE, pd.PeriodIndex(year=df_final[col_YEAR], quarter=df_final[col_QUARTER]).to_timestamp())
#DATAFRAME EXPORT TO EXCEL
df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)

In [7]:

df_final

Out[7]:

SALE_DISTRICT_CODESALE_NEIGHBOURHOOD_CODESALE_NEIGHBOURHOOD_NAMESALE_QUARTER_DATESALE_YEARSALE_QUARTERSALE_NUMBER_NEWSALE_NUMBER_PROTECTEDSALE_NUMBER_USEDSALE_AVG_MQ_NEWSALE_AVG_MQ_PROTECTEDSALE_AVG_MQ_USEDSALE_EUROS_x_1000_NEWSALE_EUROS_x_1000_USEDSALE_PRICE_EUROS_x_MQ_NEWSALE_PRICE_EUROS_x_MQ_USED
00101el Raval2022-04-0120222310966.067.0395.5254.95960.23632.6
10102el Barri Gòtic2022-04-012022265988.085.9326.4404.14297.54919.5
20103la Barceloneta2022-04-012022225344.5206.44704.7
30104Sant Pere, Santa Caterina i la Ribera2022-04-012022236661.366.9219.2307.73604.14371.1
40205el Fort Pienc2022-04-012022215692.090.5370.64208.9
13091069Diagonal Mar i el Front Marítim del Poblenou2018-01-01201811295.3509.75032.1
13101070el Besòs i el Maresme2018-01-0120181882102.070.4397.9236.93795.72985.9
13111071Provençals del Poblenou2018-01-01201811872.067.9251.33909.4
13121072Sant Martí de Provençals2018-01-012018113945.074.7270.33451.5
13131073la Verneda i la Pau2018-01-0120181118678.066.072.2153.22180.1

1314 rows × 16 columns