Python
Pandas, Requests, Beautiful Soup
JUPYTER LAB – Barcelona Real Estate Market – Property Rent Data
JUPYTER LAB¶
BARCELONA: Real Estate Market – Property Rent¶
METHODOLOGY¶
DATA¶
Data can be accessed via APIs. Data is stored by quarter. Each quarter has its own API and contains:
- data aggregated at the city level
- data aggregated at the neighbourhood level (for all 73 neighbourhoods comprising the city)
Relevant data to extract is at the Neighbourhood level – 73 neighburhoods in total, in particular:
- DISTRICT_CODE
- DISTRICT_NAME
- NEIGHBOURHOOD_CODE
- NEIGHBOURHOOD_NAME
- RENT_QUARTER_DATE: datetype added by merging RENT_YEAR and RENT_QUARTER
- RENT_YEAR
- RENT_QUARTER
- RENT_PRICE_EUROS_x_MQ: average rent per square metre by neighbourhood
- RENT_PRICE_EUROS_x_MONTH: average rent per month by neighbourhood
Source: Adjuntamiento de Barcelona – https://opendata-ajuntament.barcelona.cat/data/es/dataset/est-mercat-immobiliari-lloguer-mitja-mensual
PROCESS¶
API¶
APIs are constructed and added to a list to be iterated later
DATA NORMALISATION:¶
District and Neighbourhood codes need to have a ‘0’ in front of all codes up to 10.
DATA EXPORT TO EXCEL¶
Results are exported onto an Excel file. The prefix enables to differenciate the file and needs to be set at the beginning. The full name of the file, the sheet_name and the index_label are assigned automatically.
SEQUENCE¶
1) API
2) DATA NORMALISATION
3) DATA EXPORT TO EXCEL
JUPYTER LAB¶
BARCELONA: Real Estate Market – Property Rent¶
REQUIRED LIBRARIES¶
In [17]:
import pandas as pd import requests
Average monthly rent (€/month) and per surface (€/m2) of the city of Barcelona¶
CREATE LIST OF RESOURCE IDS¶
In [18]:
#RESOURCE IDs: R_ID_2022 = 'e96bf614-467b-40ab-91b9-e48a616ea775' R_ID_2021 = 'cfc45f2b-62eb-4621-8486-1b90e36b4bfe' R_ID_2020 = '47c9d64d-317a-45d0-8c45-45488df8601c' R_ID_2019 = '004c76b1-6269-4136-89b2-89fd47046930' R_ID_2018 = '3dc45b16-42a9-4f57-9863-e6d1a4f5869f' R_ID_LIST = [R_ID_2022,R_ID_2021,R_ID_2020,R_ID_2019,R_ID_2018]
SET PREFIX AND PERIOD TO NAME COLUMNS AND EXCEL EXPORT FILE¶
In [19]:
#PREFIX for COLUMN NAMES col_prefix = 'RENT' #SET PERIOD for EXCEL FILE NAME start_year = 2018 end_year = 2022
In [23]:
#NAME THE EXPORT EXCEL FILE excel_file_name = 'T_REAL_ESTATE_{}_YEAR_QUARTER_{}-{}.xlsx'.format(col_prefix,end_year,start_year) excel_sheet_name = '{}'.format(col_prefix) excel_index_label = '{}_INDEX'.format(col_prefix) #NAME COLUMNS col_DISTRICT_CODE = '{}_DISTRICT_CODE'.format(col_prefix) col_DISTRICT_NAME = '{}_DISTRICT_NAME'.format(col_prefix) col_NEIGHBOURHOOD_CODE = '{}_NEIGHBOURHOOD_CODE'.format(col_prefix) col_NEIGHBOURHOOD_NAME = '{}_NEIGHBOURHOOD_NAME'.format(col_prefix) col_QUARTER_DATE = '{}_QUARTER_DATE'.format(col_prefix) col_YEAR = '{}_YEAR'.format(col_prefix) col_QUARTER = '{}_QUARTER'.format(col_prefix) col_PRICE_EUROS_x_MONTH = '{}_PRICE_EUROS_x_MONTH'.format(col_prefix) col_PRICE_EUROS_x_MQ = '{}_PRICE_EUROS_x_MQ'.format(col_prefix) column_order = [col_DISTRICT_CODE,col_DISTRICT_NAME,col_NEIGHBOURHOOD_CODE,col_NEIGHBOURHOOD_NAME,col_YEAR,col_QUARTER,col_PRICE_EUROS_x_MQ,col_PRICE_EUROS_x_MONTH] #CREATE API URL List: API_URL_LIST = [] for R_ID in R_ID_LIST: url = 'https://opendata-ajuntament.barcelona.cat/data/api/action/datastore_search_sql?sql=SELECT%20*%20from%20%22{}%22'.format(R_ID) API_URL_LIST.append(url) #CREATE A FINAL DATAFRAME WHERE TO CONCAT TOGETHER ALL SINGLE URL DATAFRAMES df_final = pd.DataFrame() #API for url in API_URL_LIST: response = requests.get(url) if response.ok: data = response.json() result = data.get('result') records = result.get('records') else: print('Problem with: ', url) df = pd.DataFrame.from_dict(records) df = df.drop(df.columns[[3,8]],axis = 1) df_EUROS_MONTH = df.loc[df['Lloguer_mitja'] == 'Lloguer mitjà mensual (Euros/mes)'] df_EUROS_MONTH = df_EUROS_MONTH.rename(columns={'Preu':'Preu_Euros_Month'}) df_EUROS_MQ = df.loc[df['Lloguer_mitja'] == 'Lloguer mitjà per superfície (Euros/m2 mes)'] df_merge = pd.merge(df_EUROS_MQ, df_EUROS_MONTH[['Codi_Barri','Any','Preu_Euros_Month']],on=['Codi_Barri','Any'], how='inner') df_merge = df_merge.drop(columns=['Lloguer_mitja']) #RENAME COLUMNS df_merge.columns = [col_DISTRICT_CODE, col_DISTRICT_NAME, col_QUARTER, col_PRICE_EUROS_x_MQ, col_NEIGHBOURHOOD_NAME, col_NEIGHBOURHOOD_CODE, col_YEAR, col_PRICE_EUROS_x_MONTH] #REORDER COLUMNS df_concat = df_merge[column_order] #CONCAT TO FINAL DATAFRAME df_final = pd.concat([df_final,df_concat], ignore_index=True) #DISTRICT AND NEIGHBOURHOOD CODES: INSERT '0' IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10 df_final[[col_DISTRICT_CODE]] = df_final[[col_DISTRICT_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_final[[col_NEIGHBOURHOOD_CODE]] = df_final[[col_NEIGHBOURHOOD_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) #REPLACE NA AND - WITH EMPTY CELL df_final[col_PRICE_EUROS_x_MQ] = df_final[col_PRICE_EUROS_x_MQ].str.replace('-','').str.replace('NA','') df_final[col_PRICE_EUROS_x_MONTH] = df_final[col_PRICE_EUROS_x_MONTH].str.replace('-','').str.replace('NA','') #RESET DATA TYPES df_final[col_DISTRICT_CODE] = df_final[col_DISTRICT_CODE].astype('string') df_final[col_DISTRICT_NAME] = df_final[col_DISTRICT_NAME].astype('string') df_final[col_NEIGHBOURHOOD_CODE] = df_final[col_NEIGHBOURHOOD_CODE].astype('string') df_final[col_NEIGHBOURHOOD_NAME] = df_final[col_NEIGHBOURHOOD_NAME].astype('string') df_final[col_YEAR] = df_final[col_YEAR].astype('int64') df_final[col_QUARTER] = df_final[col_QUARTER].astype('int64') #INSERT DATE COLUMN DERIVED FROM YEAR AND QUARTER COLUMNS df_final.insert(4, col_QUARTER_DATE, pd.PeriodIndex(year=df_final[col_YEAR], quarter=df_final[col_QUARTER]).to_timestamp()) #DATAFRAME EXPORT TO EXCEL df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)
In [24]:
df_final
Out[24]:
RENT_DISTRICT_CODE | RENT_DISTRICT_NAME | RENT_NEIGHBOURHOOD_CODE | RENT_NEIGHBOURHOOD_NAME | RENT_QUARTER_DATE | RENT_YEAR | RENT_QUARTER | RENT_PRICE_EUROS_x_MQ | RENT_PRICE_EUROS_x_MONTH | |
---|---|---|---|---|---|---|---|---|---|
0 | 01 | Ciutat Vella | 01 | el Raval | 2022-01-01 | 2022 | 1 | 15.5 | 917.0 |
1 | 01 | Ciutat Vella | 01 | el Raval | 2022-01-01 | 2022 | 1 | 15.5 | 872.0 |
2 | 01 | Ciutat Vella | 01 | el Raval | 2022-04-01 | 2022 | 2 | 14.7 | 917.0 |
3 | 01 | Ciutat Vella | 01 | el Raval | 2022-04-01 | 2022 | 2 | 14.7 | 872.0 |
4 | 01 | Ciutat Vella | 02 | el Barri Gòtic | 2022-01-01 | 2022 | 1 | 17.2 | 1181.6 |
… | … | … | … | … | … | … | … | … | … |
4959 | 10 | Sant Martí | 73 | la Verneda i la Pau | 2018-07-01 | 2018 | 3 | 11.42 | 756.1 |
4960 | 10 | Sant Martí | 73 | la Verneda i la Pau | 2018-10-01 | 2018 | 4 | 11.81 | 715.37 |
4961 | 10 | Sant Martí | 73 | la Verneda i la Pau | 2018-10-01 | 2018 | 4 | 11.81 | 719.55 |
4962 | 10 | Sant Martí | 73 | la Verneda i la Pau | 2018-10-01 | 2018 | 4 | 11.81 | 767.1 |
4963 | 10 | Sant Martí | 73 | la Verneda i la Pau | 2018-10-01 | 2018 | 4 | 11.81 | 756.1 |
4964 rows × 9 columns
JUPYTER LAB – Barcelona Real Estate Market – Property Sale Data
JUPYTER LAB¶
BARCELONA: Real Estate Market – Property Sale¶
METHODOLOGY¶
DATA¶
Data is stored in html pages. Each page refers to a year and contains:
- data aggregated at the city level
- data aggregated at the neighbourhood level (for all 73 neighbourhoods comprising the city)
Relevant data to extract is at the Neighbourhood level – 73 neighburhoods in total, in particular:
- DISTRICT_CODE
- NEIGHBOURHOOD_CODE
- NEIGHBOURHOOD_NAME
- YEAR
- NUMBER_NEW: number of newly built houses traded
- NUMBER_PROTECTED: number of houses in protected regime traded
- NUMBER_USED: number of preexisting houses traded
- AVG_MQ_NEW: average size in metre squares of newly built houses traded
- AVG_MQ_PROTECTED: average size in metre squares of houses in protected regime traded
- AVG_MQ_USED: average size in metre squares of preexisting houses traded
- PRICE_EUROS_X1000_NEW: average price in thousand euros of newly built houses traded
- PRICE_EUROS_X1000_USED: average price in thousand euros of preexisting houses traded
- PRICE_EUROS_MQ_NEW: average price in euros per metre square of newly built houses traded
- PRICE_EUROS_MQ_USED: average price in euros per metre square of preexisting houses traded
Source: Adjuntamiento de Barcelona – https://ajuntament.barcelona.cat/estadistica/catala/Estadistiques_per_temes/Habitatge_i_mercat_immobiliari/Mercat_immobiliari/Compravenda_habitatges/index.htm
PROCESS¶
WEB SCRAPING¶
The first column of the html page contains merged data is this order:
- District Code
- Neighbourhood Code
- Neighbourhood Name
These codes and names are extracted and placed into separate columns for later reference in visualizations.
This is done by replacing the 5 spaces xa0 between the two codes by with a to enable a column split to separate the codes into distinct columns.
The character is later removed from the NEIGHBOURHOOD_NAME column.
District and Neighbourhood codes need to have a ‘0’ in front of all codes up to 10.DATA CLEANING, NORMALISATION AND TRANSLATION:¶
Data is cleaned of spaces (Xa0) and dots (.) in numbers during web scraping.
DATA EXPORT TO EXCEL¶
Results are exported onto an Excel file. The prefix enables to differenciate the file and needs to be set at the beginning. The full name of the file, the sheet_name and the index_label are assigned automatically.
SEQUENCE¶
1) WEB SCRAPING
2) DATA CLEANING, NORMALISATION AND TRANSLATION
3) DISTRICT AND NEIGHBOURHOOD CODES: INSERT ‘0’ IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10
4) DATA EXPORT TO EXCEL
JUPYTER LAB¶
BARCELONA: Real Estate Market – Property Sale¶
REQUIRED LIBRARIES¶
In [1]:
import requests import pandas as pd from bs4 import BeautifulSoup
SET THE PERIOD¶
In [2]:
start_year = 2018 end_year = 2022
SET PREFIX FOR COLUMNS AND EXCEL FILE NAME¶
In [3]:
#PREFIX for COLUMN NAMES col_prefix = 'SALE'
WEB SCRAPING AND DATA EXPORT¶
In [6]:
#TO NAME THE EXPORT EXCEL FILE excel_file_name = 'T_REAL_ESTATE_{}_YEAR_QUARTER_{}-{}.xlsx'.format(col_prefix,end_year,start_year) excel_sheet_name = '{}'.format(col_prefix) excel_index_label = '{}_INDEX'.format(col_prefix) #TO NAME COLUMNS col_DISTRICT_CODE = '{}_DISTRICT_CODE'.format(col_prefix) col_NEIGHBOURHOOD_CODE = '{}_NEIGHBOURHOOD_CODE'.format(col_prefix) col_NEIGHBOURHOOD_NAME = '{}_NEIGHBOURHOOD_NAME'.format(col_prefix) col_QUARTER_DATE = '{}_QUARTER_DATE'.format(col_prefix) col_YEAR = '{}_YEAR'.format(col_prefix) col_QUARTER = '{}_QUARTER'.format(col_prefix) col_NUMBER_NEW = '{}_NUMBER_NEW'.format(col_prefix) col_NUMBER_PROTECTED = '{}_NUMBER_PROTECTED'.format(col_prefix) col_NUMBER_USED = '{}_NUMBER_USED'.format(col_prefix) col_AVG_MQ_NEW = '{}_AVG_MQ_NEW'.format(col_prefix) col_AVG_MQ_PROTECTED = '{}_AVG_MQ_PROTECTED'.format(col_prefix) col_AVG_MQ_USED = '{}_AVG_MQ_USED'.format(col_prefix) col_PRICE_EUROS_x_1000_NEW = '{}_EUROS_x_1000_NEW'.format(col_prefix) col_PRICE_EUROS_x_1000_USED = '{}_EUROS_x_1000_USED'.format(col_prefix) col_PRICE_EUROS_x_MQ_NEW = '{}_PRICE_EUROS_x_MQ_NEW'.format(col_prefix) col_PRICE_EUROS_x_MQ_USED = '{}_PRICE_EUROS_x_MQ_USED'.format(col_prefix) #TO CREATE A LIST OF YEARS years = list(reversed(range(start_year,end_year+1,1))) #CREATE LIST OF QUARTERS quarters = list(reversed(range(1,5,1))) #TO CREATE A FINAL DATAFRAME WHERE TO CONCAT TOGETHER ALL SINGLE URL DATAFRAMES df_final = pd.DataFrame() #WEBSCRAPING ALL SINGLE URLs AND CONCAT SINGLE URL DATAFRAMES TO FINAL DATAFRAME for year in years: df_year =pd.DataFrame() for quarter in quarters: url ='https://ajuntament.barcelona.cat/estadistica/catala/Estadistiques_per_temes/Habitatge_i_mercat_immobiliari/Mercat_immobiliari/Compravenda_habitatges/a{}/t0{}02.htm'.format(year,quarter) #DISCART 4th AND 3rd QUARTERS AS DATA NOT AVAILABLE YET if (year == 2022 and quarter == 4) | (year == 2022 and quarter == 3): continue response = requests.get(url).text soup = BeautifulSoup(response, 'html.parser') tables = soup.find_all('tr') columns = [] for table in tables: headers = table.find_all('td', class_='WhadsColVar1') for header in headers: column = header.getText().strip().replace('xa0xa0xa0xa0xa0','').replace('xa0','').replace(' ','')#trim spaces at the beginning and at the end and remove 5X spaces xa0 in between and other spaces within strings columns.append(column) data = [] for table in tables: row =[] neighbourhoods = table.find_all('td', class_='WhadsRowVar3') for n in neighbourhoods: neighbourhood = n.getText().strip().replace('xa0xa0xa0xa0xa0','*')#trim spaces at the beginning and at the end and remove 5X spaces xa0 in between and replace with * to use later for split row.append(neighbourhood) values = table.find_all('td', class_='WhadsDades') for v in values: value = v.getText().replace('.','').replace(',','.').replace('-','').replace('NA','').strip()#trim spaces xa0 and remove . dots from numbers (to avoid problems with number conventions later) row.append(value) data.append(row) #SINGLE URL DATAFRAME df_quarter = pd.DataFrame(data, columns = columns) #insert columns at the left of the dataframe: NAT_DISTRICT_CODE, NAT_NEIGHBOURHOOD_CODE,NAT_NEIGHBOURHOOD_NAME df_quarter.insert(0,'col_NEIGHBOURHOOD_NAME', df_quarter['Dte.Barris'].astype(str).str.replace('*','',regex=True).str.replace('d+', '',regex=True).str.replace('.','',regex=True).str.lstrip())#remove the split character *, digits, dot, and space left at the beginning df_quarter[['split1','split2']] = df_quarter['Dte.Barris'].str.split('*',expand=True) df_quarter.insert(0,'col_NEIGHBOURHOOD_CODE', df_quarter['split2'].str.extract('(d+)')) df_quarter.insert(0,'col_DISTRICT_CODE',df_quarter['split1']) #ADD YEAR COLUMN df_quarter.insert(3,'col_YEAR', year) #ADD QUARTER COLUMN df_quarter.insert(4,'col_QUARTER', quarter) #keep columns of interest df_quarter.drop(['Dte.Barris','Total','','split1','split2'],axis=1, inplace=True) #keep rows of interest r1 = [] for i in range(10,83): r1.append(i) df_quarter = df_quarter.iloc[r1,:] #RENAME COLUMNS df_quarter.columns = [ col_DISTRICT_CODE, col_NEIGHBOURHOOD_CODE, col_NEIGHBOURHOOD_NAME, col_YEAR, col_QUARTER, col_NUMBER_NEW, col_NUMBER_PROTECTED, col_NUMBER_USED, col_AVG_MQ_NEW, col_AVG_MQ_PROTECTED, col_AVG_MQ_USED, col_PRICE_EUROS_x_1000_NEW, col_PRICE_EUROS_x_1000_USED, col_PRICE_EUROS_x_MQ_NEW, col_PRICE_EUROS_x_MQ_USED ] #CONCAT SINGLE QUARTER URL DATAFRAMES INTO MERGED YEAR DATAFRAME df_year = pd.concat((df_year,df_quarter), ignore_index=True) #CONCAT SINGLE QUARTER URL DATAFRAMES INTO MERGED YEAR DATAFRAME df_final = pd.concat((df_final,df_year), ignore_index=True) #DISTRICT AND NEIGHBOURHOOD CODES: INSERT '0' IN FRONT OF DISTRICT CODES AND NEIGHBOURHOOD CODES UP TO 10 df_final[[col_DISTRICT_CODE]] = df_final[[col_DISTRICT_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_final[[col_NEIGHBOURHOOD_CODE]] = df_final[[col_NEIGHBOURHOOD_CODE]].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) #RESET DATA TYPES df_final[col_DISTRICT_CODE] = df_final[col_DISTRICT_CODE].astype('string') df_final[col_NEIGHBOURHOOD_CODE] = df_final[col_NEIGHBOURHOOD_CODE].astype('string') df_final[col_NEIGHBOURHOOD_NAME] = df_final[col_NEIGHBOURHOOD_NAME].astype('string') df_final[col_YEAR] = df_final[col_YEAR].astype('int64') df_final[col_QUARTER] = df_final[col_QUARTER].astype('int64') #INSERT DATE COLUMN DERIVED FROM YEAR AND QUARTER COLUMNS df_final.insert(3, col_QUARTER_DATE, pd.PeriodIndex(year=df_final[col_YEAR], quarter=df_final[col_QUARTER]).to_timestamp()) #DATAFRAME EXPORT TO EXCEL df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)
In [7]:
df_final
Out[7]:
SALE_DISTRICT_CODE | SALE_NEIGHBOURHOOD_CODE | SALE_NEIGHBOURHOOD_NAME | SALE_QUARTER_DATE | SALE_YEAR | SALE_QUARTER | SALE_NUMBER_NEW | SALE_NUMBER_PROTECTED | SALE_NUMBER_USED | SALE_AVG_MQ_NEW | SALE_AVG_MQ_PROTECTED | SALE_AVG_MQ_USED | SALE_EUROS_x_1000_NEW | SALE_EUROS_x_1000_USED | SALE_PRICE_EUROS_x_MQ_NEW | SALE_PRICE_EUROS_x_MQ_USED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 01 | el Raval | 2022-04-01 | 2022 | 2 | 3 | 109 | 66.0 | 67.0 | 395.5 | 254.9 | 5960.2 | 3632.6 | ||
1 | 01 | 02 | el Barri Gòtic | 2022-04-01 | 2022 | 2 | 6 | 59 | 88.0 | 85.9 | 326.4 | 404.1 | 4297.5 | 4919.5 | ||
2 | 01 | 03 | la Barceloneta | 2022-04-01 | 2022 | 2 | 2 | 53 | 44.5 | 206.4 | 4704.7 | |||||
3 | 01 | 04 | Sant Pere, Santa Caterina i la Ribera | 2022-04-01 | 2022 | 2 | 3 | 66 | 61.3 | 66.9 | 219.2 | 307.7 | 3604.1 | 4371.1 | ||
4 | 02 | 05 | el Fort Pienc | 2022-04-01 | 2022 | 2 | 1 | 56 | 92.0 | 90.5 | 370.6 | 4208.9 | ||||
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
1309 | 10 | 69 | Diagonal Mar i el Front Marítim del Poblenou | 2018-01-01 | 2018 | 1 | 12 | 95.3 | 509.7 | 5032.1 | ||||||
1310 | 10 | 70 | el Besòs i el Maresme | 2018-01-01 | 2018 | 1 | 8 | 82 | 102.0 | 70.4 | 397.9 | 236.9 | 3795.7 | 2985.9 | ||
1311 | 10 | 71 | Provençals del Poblenou | 2018-01-01 | 2018 | 1 | 1 | 8 | 72.0 | 67.9 | 251.3 | 3909.4 | ||||
1312 | 10 | 72 | Sant Martí de Provençals | 2018-01-01 | 2018 | 1 | 1 | 39 | 45.0 | 74.7 | 270.3 | 3451.5 | ||||
1313 | 10 | 73 | la Verneda i la Pau | 2018-01-01 | 2018 | 1 | 1 | 1 | 86 | 78.0 | 66.0 | 72.2 | 153.2 | 2180.1 |
1314 rows × 16 columns