Data Cleaning – Project: Barcelona – Tourism

Python + SQL

Pandas, Requests, Beautiful Soup, Google Big Query, DuckDB


SQL – BIGQUERY – JUPYTER LAB – Barcelona Number of Foreign Tourists by Neighbourhood per Day



BARCELONA-SQL-BIGQUERY_published




SQL – BIGQUERY

Number of National and Foreign TOURISTS per DAY, Lodging Category and Neighbourhood in 2022

REQUIRED LIBRARIES

In [103]:

import pandas as pd
from pandas.io import gbq
import duckdb

In [2]:

project_id=''

INFORMATION_SCHEMA.SCHEMATA

In [3]:

df_information_schema_schemata = gbq.read_gbq(query='SELECT schema_name FROM region-us.INFORMATION_SCHEMA.SCHEMATA;', project_id=project_id)
df_information_schema_schemata

Out[3]:

schema_name
0TOURISM
1REAL_ESTATE
2DICTIONARY
3DIVERSITY

INFORMATION_SCHEMA.TABLES – TOURISM

In [4]:

df_information_schema_tables_tourism = gbq.read_gbq(query='SELECT table_name FROM TOURISM.INFORMATION_SCHEMA.TABLES;')
df_information_schema_tables_tourism

Out[4]:

table_name
0TL
1TT
2FT

In [5]:

df_information_schema_tables_tourism_tl = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "TL";')
df_information_schema_tables_tourism_tl

Out[5]:

table_schematable_namecolumn_namedata_type
0TOURISMTLTOURIST_LODGINGS_PD_ONLY_INDEXINT64
1TOURISMTLn_practiceSTRING
2TOURISMTLrtcSTRING
3TOURISMTLnameSTRING
4TOURISMTLcategorySTRING
5TOURISMTLaddressSTRING
6TOURISMTLstreet_typeSTRING
7TOURISMTLstreetSTRING
8TOURISMTLstreet_number_1INT64
9TOURISMTLstreet_letter_1STRING
10TOURISMTLstreet_number_2INT64
11TOURISMTLstreet_letter_2STRING
12TOURISMTLblockSTRING
13TOURISMTLentranceSTRING
14TOURISMTLstairSTRING
15TOURISMTLfloorSTRING
16TOURISMTLdoorSTRING
17TOURISMTLdistrict_codeINT64
18TOURISMTLdistrict_nameSTRING
19TOURISMTLneighbourhood_codeINT64
20TOURISMTLneighbourhood_nameSTRING
21TOURISMTLlongitudeFLOAT64
22TOURISMTLlatitudeFLOAT64
23TOURISMTLn_placesINT64

In [6]:

df_information_schema_tables_tourism_tt = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "TT";')
df_information_schema_tables_tourism_tt

Out[6]:

table_schematable_namecolumn_namedata_type
0TOURISMTTyearINT64
1TOURISMTTmonthINT64
2TOURISMTTlodging_typeSTRING
3TOURISMTTn_touristsINT64
4TOURISMTTovernight_staysINT64
5TOURISMTTaverage_lenght_stayFLOAT64

In [7]:

df_information_schema_tables_tourism_ft = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "FT";')
df_information_schema_tables_tourism_ft

Out[7]:

table_schematable_namecolumn_namedata_type
0TOURISMFTFT_INDEXINT64
1TOURISMFTFT_YEARINT64
2TOURISMFTFT_MONTHINT64
3TOURISMFTFT_COUNTRYSTRING
4TOURISMFTFT_N_TOURISTSINT64

SOURCE DATAFRAMES

In [8]:

df_tourism_tl = gbq.read_gbq(query='SELECT * FROM TOURISM.TL;')
df_tourism_tl.head(1)

Out[8]:

TOURIST_LODGINGS_PD_ONLY_INDEXn_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
0940901-90-A-128HB-003893Hotel Reding CromaHotel 3 estrellesGRAVINA 5 7nanGRAVINA5NoneNoneNoneNone1Ciutat Vella1el Raval2.16566941.38532886

1 rows × 24 columns

In [9]:

df_tl = df_tourism_tl.copy()

In [10]:

df_tourism_tt = gbq.read_gbq(query='SELECT * FROM TOURISM.TT;')
df_tourism_tt.head(1)

Out[10]:

yearmonthlodging_typen_touristsovernight_staysaverage_lenght_stay
020221hotel3014747452242.471935

In [11]:

df_tt = df_tourism_tt.copy()

In [12]:

df_tourism_ft = gbq.read_gbq(query='SELECT * FROM TOURISM.FT;')
df_tourism_ft.head(1)

Out[12]:

FT_INDEXFT_YEARFT_MONTHFT_COUNTRYFT_N_TOURISTS
0020221France39625

In [13]:

df_ft = df_tourism_ft.copy()

SQL

ESTIMATING THE NUMBER OF NATIONAL AND FOREIGN TOURISTS PER DAY IN 2022

SOURCE:
https://www.observatoriturisme.barcelona/en/destination-barcelona-tourism-activity-latest-data

In [107]:

q = """
SELECT year as 'year_x',month as 'month_x',SUM(n_tourists) as 'TOTAL_N_TOURISTS_PER_MONTH'
FROM df_tt 
GROUP BY month, year
;"""
df_total_n_tourists_per_month_2022 = duckdb.query(q).df()
df_total_n_tourists_per_month_2022

Out[107]:

year_xmonth_xTOTAL_N_TOURISTS_PER_MONTH
020221427243.0
120222559050.0
220223694636.0
320224858276.0
420225907574.0
520226910784.0
6202271025155.0
720228982400.0
820229926717.0
9202210939317.0
10202211773737.0
11202212738847.0

In [108]:

q = """
SELECT FT_YEAR,FT_MONTH,SUM(FT_N_TOURISTS) as 'TOTAL_N_FOREIGN_TOURISTS_PER_MONTH'
FROM df_ft WHERE FT_YEAR=2022
GROUP BY FT_YEAR, FT_MONTH
ORDER BY FT_YEAR, FT_MONTH
;"""
df_n_foreign_tourists_per_month_2022 = duckdb.query(q).df()
df_n_foreign_tourists_per_month_2022

Out[108]:

FT_YEARFT_MONTHTOTAL_N_FOREIGN_TOURISTS_PER_MONTH
020221212111.0
120222239221.0
220223349409.0
320224485751.0
420225608404.0
520226588006.0
620227595319.0
720228553449.0
820229575061.0
9202210628197.0
10202211520641.0
11202212359615.0

In [110]:

q = """
SELECT year,month,lodging_type,n_tourists AS 'N_TOURISTS_PER_LODGING_TYPE_PER_MONTH',overnight_stays,average_lenght_stay,TOTAL_N_TOURISTS_PER_MONTH,
n_tourists/TOTAL_N_TOURISTS_PER_MONTH*100 as 'LODGING_TYPE_SHARE'
FROM df_tt LEFT JOIN df_total_n_tourists_per_month_2022 ON month=month_x
;"""
df_lodging_share_total_n_tourists_2022 = duckdb.query(q).df()
df_lodging_share_total_n_tourists_2022

Out[110]:

yearmonthlodging_typeN_TOURISTS_PER_LODGING_TYPE_PER_MONTHovernight_staysaverage_lenght_stayTOTAL_N_TOURISTS_PER_MONTHLODGING_TYPE_SHARE
020221hotel3014747452242.471935427243.070.562654
120222hotel42364810449432.466536559050.075.779984
220223hotel54159414283712.637346694636.077.968029
320224hotel66335417940842.704565858276.077.289124
420225hotel69648719024162.731445907574.076.741621
520226hotel70112819451682.774341910784.076.980711
620227hotel75224020919582.7809711025155.073.378172
720228hotel72094421148542.933451982400.073.385993
820229hotel71038618515812.606444926717.076.656196
9202210hotel73615319473452.645299939317.078.371093
10202211hotel59068615116022.559062773737.076.341961
11202212hotel54352513561662.495131738847.073.563945
1220221homes_for_tourist_use1257696185694.918295427243.029.437346
1320222homes_for_tourist_use1354026145434.538655559050.024.220016
1420223homes_for_tourist_use1530426250554.084206694636.022.031971
1520224homes_for_tourist_use1949228057064.133479858276.022.710876
1620225homes_for_tourist_use2110877913493.748923907574.023.258379
1720226homes_for_tourist_use2096568585584.095080910784.023.019289
1820227homes_for_tourist_use27291510386683.8058301025155.026.621828
1920228homes_for_tourist_use26145611366124.347240982400.026.614007
2020229homes_for_tourist_use2163318860614.095858926717.023.343804
21202210homes_for_tourist_use2031649007964.433837939317.021.628907
22202211homes_for_tourist_use1830518191594.475032773737.023.658039
23202212homes_for_tourist_use1953229668024.949785738847.026.436055

In [111]:

q = """
SELECT year,month,lodging_type,N_TOURISTS_PER_LODGING_TYPE_PER_MONTH,overnight_stays,average_lenght_stay,LODGING_TYPE_SHARE,TOTAL_N_TOURISTS_PER_MONTH,
TOTAL_N_TOURISTS_PER_MONTH-TOTAL_N_FOREIGN_TOURISTS_PER_MONTH AS 'TOTAL_N_NATIONAL_TOURISTS_PER_MONTH',
TOTAL_N_FOREIGN_TOURISTS_PER_MONTH,
(LODGING_TYPE_SHARE/100*TOTAL_N_NATIONAL_TOURISTS_PER_MONTH)  AS 'N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH',
(LODGING_TYPE_SHARE/100*TOTAL_N_FOREIGN_TOURISTS_PER_MONTH)  AS 'N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH'
FROM df_lodging_share_total_n_tourists_2022, df_n_foreign_tourists_per_month_2022
WHERE month=FT_MONTH AND year=FT_YEAR
;"""
df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022 = duckdb.query(q).df()
df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022

Out[111]:

yearmonthlodging_typeN_TOURISTS_PER_LODGING_TYPE_PER_MONTHovernight_staysaverage_lenght_stayLODGING_TYPE_SHARETOTAL_N_TOURISTS_PER_MONTHTOTAL_N_NATIONAL_TOURISTS_PER_MONTHTOTAL_N_FOREIGN_TOURISTS_PER_MONTHN_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTHN_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH
020221hotel3014747452242.47193570.562654427243.0215132.0212111.0151802.848889149671.151111
120222hotel42364810449432.46653675.779984559050.0319829.0239221.0242366.364712181281.635288
220223hotel54159414283712.63734677.968029694636.0345227.0349409.0269166.688507272427.311493
320224hotel66335417940842.70456577.289124858276.0372525.0485751.0287921.308355375432.691645
420225hotel69648719024162.73144576.741621907574.0299170.0608404.0229587.907752466899.092248
520226hotel70112819451682.77434176.980711910784.0322778.0588006.0248476.799751452651.200249
620227hotel75224020919582.78097173.3781721025155.0429836.0595319.0315405.799747436834.200253
720228hotel72094421148542.93345173.385993982400.0428951.0553449.0314789.952915406154.047085
820229hotel71038618515812.60644476.656196926717.0351656.0575061.0269566.112649440819.887351
9202210hotel73615319473452.64529978.371093939317.0311120.0628197.0243828.144663492324.855337
10202211hotel59068615116022.55906276.341961773737.0253096.0520641.0193218.450011397467.549989
11202212hotel54352513561662.49513173.563945738847.0379232.0359615.0278978.019536264546.980464
1220221homes_for_tourist_use1257696185694.91829529.437346427243.0215132.0212111.063329.15111162439.848889
1320222homes_for_tourist_use1354026145434.53865524.220016559050.0319829.0239221.077462.63528857939.364712
1420223homes_for_tourist_use1530426250554.08420622.031971694636.0345227.0349409.076060.31149376981.688507
1520224homes_for_tourist_use1949228057064.13347922.710876858276.0372525.0485751.084603.691645110318.308355
1620225homes_for_tourist_use2110877913493.74892323.258379907574.0299170.0608404.069582.092248141504.907752
1720226homes_for_tourist_use2096568585584.09508023.019289910784.0322778.0588006.074301.200249135354.799751
1820227homes_for_tourist_use27291510386683.80583026.6218281025155.0429836.0595319.0114430.200253158484.799747
1920228homes_for_tourist_use26145611366124.34724026.614007982400.0428951.0553449.0114161.047085147294.952915
2020229homes_for_tourist_use2163318860614.09585823.343804926717.0351656.0575061.082089.887351134241.112649
21202210homes_for_tourist_use2031649007964.43383721.628907939317.0311120.0628197.067291.855337135872.144663
22202211homes_for_tourist_use1830518191594.47503223.658039773737.0253096.0520641.059877.549989123173.450011
23202212homes_for_tourist_use1953229668024.94978526.436055738847.0379232.0359615.0100253.98046495068.019536

In [112]:

q = """
SELECT *,
(N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH*average_lenght_stay/30) AS 'N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY',
(N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH*average_lenght_stay/30) AS 'N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY'
FROM df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022
;"""
df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022 = duckdb.query(q).df()
df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022

Out[112]:

yearmonthlodging_typeN_TOURISTS_PER_LODGING_TYPE_PER_MONTHovernight_staysaverage_lenght_stayLODGING_TYPE_SHARETOTAL_N_TOURISTS_PER_MONTHTOTAL_N_NATIONAL_TOURISTS_PER_MONTHTOTAL_N_FOREIGN_TOURISTS_PER_MONTHN_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTHN_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTHN_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAYN_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY
020221hotel3014747452242.47193570.562654427243.0215132.0212111.0151802.848889149671.15111112508.22362612332.576379
120222hotel42364810449432.46653675.779984559050.0319829.0239221.0242366.364712181281.63528819926.84463414904.588703
220223hotel54159414283712.63734677.968029694636.0345227.0349409.0269166.688507272427.31149323662.86013123949.506544
320224hotel66335417940842.70456577.289124858276.0372525.0485751.0287921.308355375432.69164525956.72961533846.070377
420225hotel69648719024162.73144576.741621907574.0299170.0608404.0229587.907752466899.09224820903.55881742510.307846
520226hotel70112819451682.77434176.980711910784.0322778.0588006.0248476.799751452651.20024922978.64392341860.289420
620227hotel75224020919582.78097173.3781721025155.0429836.0595319.0315405.799747436834.20025329237.81798040494.115342
720228hotel72094421148542.93345173.385993982400.0428951.0553449.0314789.952915406154.04708530780.69822939714.435108
820229hotel71038618515812.60644476.656196926717.0351656.0575061.0269566.112649440819.88735123420.29508838299.071575
9202210hotel73615319473452.64529978.371093939317.0311120.0628197.0243828.144663492324.85533721499.94717643411.552829
10202211hotel59068615116022.55906276.341961773737.0253096.0520641.0193218.450011397467.54998916481.93205433904.801287
11202212hotel54352513561662.49513173.563945738847.0379232.0359615.0278978.019536264546.98046423202.88885122002.644487
1220221homes_for_tourist_use1257696185694.91829529.437346427243.0215132.0212111.063329.15111162439.84888910382.38083910236.585827
1320222homes_for_tourist_use1354026145434.53865524.220016559050.0319829.0239221.077462.63528857939.36471211719.2065818765.560088
1420223homes_for_tourist_use1530426250554.08420622.031971694636.0345227.0349409.076060.31149376981.68850710354.86511410480.301554
1520224homes_for_tourist_use1949228057064.13347922.710876858276.0372525.0485751.084603.691645110318.30835511656.91951815199.947151
1620225homes_for_tourist_use2110877913493.74892323.258379907574.0299170.0608404.069582.092248141504.9077528695.26453117683.035471
1720226homes_for_tourist_use2096568585584.09508023.019289910784.0322778.0588006.074301.200249135354.79975110142.31087818476.289122
1820227homes_for_tourist_use27291510386683.80583026.6218281025155.0429836.0595319.0114430.200253158484.79974714516.72831420105.538353
1920228homes_for_tourist_use26145611366124.34724026.614007982400.0428951.0553449.0114161.047085147294.95291516542.84928121344.217386
2020229homes_for_tourist_use2163318860614.09585823.343804926717.0351656.0575061.082089.887351134241.11264911207.61667418327.749996
21202210homes_for_tourist_use2031649007964.43383721.628907939317.0311120.0628197.067291.855337135872.1446639945.36993420081.163397
22202211homes_for_tourist_use1830518191594.47503223.658039773737.0253096.0520641.059877.549989123173.4500118931.79750918373.502493
23202212homes_for_tourist_use1953229668024.94978526.436055738847.0379232.0359615.0100253.98046495068.01953616541.18990015685.543430

NUMBER OF PLACES PER CATEGORY AND NEIGHBOURHOOD

This table agregates the number of places per lodging category for each neighbourhood.

Data refers to the tourist lodgings active at the end of 2022.

Data on those tourist lodgings is here treated as constant throughout 2022.

In [113]:

q = """
SELECT district_code, district_name, neighbourhood_code, neighbourhood_name, category,
SUM(n_places) AS 'NEIGHBOURHOOD_N_PLACES'
FROM df_tl
GROUP BY 3,4,1,2,5
ORDER BY 3,5
;"""
df_n_places_per_neighbourhood_category_2022 = duckdb.query(q).df()
df_n_places_per_neighbourhood_category_2022

Out[113]:

district_codedistrict_nameneighbourhood_codeneighbourhood_namecategoryNEIGHBOURHOOD_N_PLACES
01Ciutat Vella1el RavalAlbergs277.0
11Ciutat Vella1el RavalApartaments Turístics170.0
21Ciutat Vella1el RavalHabitatges d’Ús Turístic1234.0
31Ciutat Vella1el RavalHotel 1 estrella416.0
41Ciutat Vella1el RavalHotel 2 estrelles430.0
26410Sant Martí70el Besòs i el MaresmeHotel 4 estrelles superior524.0
26510Sant Martí71Provençals del PoblenouHabitatges d’Ús Turístic180.0
26610Sant Martí71Provençals del PoblenouHotel 4 estrelles356.0
26710Sant Martí72Sant Martí de ProvençalsHabitatges d’Ús Turístic82.0
26810Sant Martí73la Verneda i la PauHabitatges d’Ús Turístic49.0

269 rows × 6 columns

In [114]:

q = """
SELECT *,
CASE
WHEN  CONTAINS(category, 'Turístic')  THEN 'homes_for_tourist_use'
ELSE 'hotel'
END 
AS "LODGING_TYPE"
FROM df_n_places_per_neighbourhood_category_2022
;"""
df_n_places_per_neighbourhood_category_lodging_type_2022 = duckdb.query(q).df()
df_n_places_per_neighbourhood_category_lodging_type_2022

Out[114]:

district_codedistrict_nameneighbourhood_codeneighbourhood_namecategoryNEIGHBOURHOOD_N_PLACESLODGING_TYPE
01Ciutat Vella1el RavalAlbergs277.0hotel
11Ciutat Vella1el RavalApartaments Turístics170.0homes_for_tourist_use
21Ciutat Vella1el RavalHabitatges d’Ús Turístic1234.0homes_for_tourist_use
31Ciutat Vella1el RavalHotel 1 estrella416.0hotel
41Ciutat Vella1el RavalHotel 2 estrelles430.0hotel
26410Sant Martí70el Besòs i el MaresmeHotel 4 estrelles superior524.0hotel
26510Sant Martí71Provençals del PoblenouHabitatges d’Ús Turístic180.0homes_for_tourist_use
26610Sant Martí71Provençals del PoblenouHotel 4 estrelles356.0hotel
26710Sant Martí72Sant Martí de ProvençalsHabitatges d’Ús Turístic82.0homes_for_tourist_use
26810Sant Martí73la Verneda i la PauHabitatges d’Ús Turístic49.0homes_for_tourist_use

269 rows × 7 columns

In [115]:

q = """
SELECT LODGING_TYPE as 'LODGING_TYPE_X',SUM(NEIGHBOURHOOD_N_PLACES) AS "TOTAL_N_PLACES_PER_LODGING_TYPE"
FROM df_n_places_per_neighbourhood_category_lodging_type_2022
GROUP BY LODGING_TYPE
;"""
df_total_n_places_per_lodging_type_2022 = duckdb.query(q).df()
df_total_n_places_per_lodging_type_2022

Out[115]:

LODGING_TYPE_XTOTAL_N_PLACES_PER_LODGING_TYPE
0hotel87041.0
1homes_for_tourist_use57564.0

In [116]:

q = """
SELECT year,month,lodging_type AS 'lodging_type_x',	N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY,	N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY,TOTAL_N_PLACES_PER_LODGING_TYPE,
N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY/TOTAL_N_PLACES_PER_LODGING_TYPE*100 AS 'PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES',
N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY/TOTAL_N_PLACES_PER_LODGING_TYPE*100 AS 'PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES'
FROM df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022, df_total_n_places_per_lodging_type_2022
WHERE LODGING_TYPE=LODGING_TYPE_X
;"""
df_n_national_n_foreign_tourists_per_n_places = duckdb.query(q).df()
df_n_national_n_foreign_tourists_per_n_places

Out[116]:

yearmonthlodging_type_xN_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAYN_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAYTOTAL_N_PLACES_PER_LODGING_TYPEPERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACESPERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES
020221hotel12508.22362612332.57637987041.014.37049614.168698
120222hotel19926.84463414904.58870387041.022.89363017.123641
220223hotel23662.86013123949.50654487041.027.18587827.515202
320224hotel25956.72961533846.07037787041.029.82126838.885204
420225hotel20903.55881742510.30784687041.024.01576148.839407
520226hotel22978.64392341860.28942087041.026.39979348.092611
620227hotel29237.81798040494.11534287041.033.59085746.523036
720228hotel30780.69822939714.43510887041.035.36344745.627273
820229hotel23420.29508838299.07157587041.026.90719944.001185
9202210hotel21499.94717643411.55282987041.024.70094249.874832
10202211hotel16481.93205433904.80128787041.018.93582638.952679
11202212hotel23202.88885122002.64448787041.026.65742425.278483
1220221homes_for_tourist_use10382.38083910236.58582757564.018.03623917.782965
1320222homes_for_tourist_use11719.2065818765.56008857564.020.35856915.227503
1420223homes_for_tourist_use10354.86511410480.30155457564.017.98843918.206347
1520224homes_for_tourist_use11656.91951815199.94715157564.020.25036426.405300
1620225homes_for_tourist_use8695.26453117683.03547157564.015.10538630.718914
1720226homes_for_tourist_use10142.31087818476.28912257564.017.61919132.096951
1820227homes_for_tourist_use14516.72831420105.53835357564.025.21841534.927278
1920228homes_for_tourist_use16542.84928121344.21738657564.028.73818637.079107
2020229homes_for_tourist_use11207.61667418327.74999657564.019.46983631.838910
21202210homes_for_tourist_use9945.36993420081.16339757564.017.27706534.884934
22202211homes_for_tourist_use8931.79750918373.50249357564.015.51629131.918391
23202212homes_for_tourist_use16541.18990015685.54343057564.028.73530327.248877

CONCATENATED TABLE: NUMBER OF NATIONAL AND FOREIGN TOURISTS PER DAY, CATEGORY AND NEIGHBOURHOOD IN 2022

In [123]:

q = """
SELECT year, month, district_code, district_name, neighbourhood_code, neighbourhood_name, category, LODGING_TYPE, NEIGHBOURHOOD_N_PLACES,
PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES+PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_BOOKED_PLACES',
100-(PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES+PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES) AS 'PERCENTAGE_FREE_PLACES',
PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS',
PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS',
(PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES/100*NEIGHBOURHOOD_N_PLACES) AS 'N_NATIONAL_TOURISTS_PER_DAY',	
(PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES/100*NEIGHBOURHOOD_N_PLACES) AS 'N_FOREIGN_TOURISTS_PER_DAY',
N_NATIONAL_TOURISTS_PER_DAY+N_FOREIGN_TOURISTS_PER_DAY AS 'TOTAL_N_TOURISTS_PER_DAY'
FROM 
df_n_places_per_neighbourhood_category_lodging_type_2022, df_n_national_n_foreign_tourists_per_n_places
WHERE  lodging_type_x=LODGING_TYPE
;"""
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 = duckdb.query(q).df()
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022

Out[123]:

yearmonthdistrict_codedistrict_nameneighbourhood_codeneighbourhood_namecategoryLODGING_TYPENEIGHBOURHOOD_N_PLACESPERCENTAGE_BOOKED_PLACESPERCENTAGE_FREE_PLACESPERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTSPERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTSN_NATIONAL_TOURISTS_PER_DAYN_FOREIGN_TOURISTS_PER_DAYTOTAL_N_TOURISTS_PER_DAY
02022121Ciutat Vella1el RavalAlbergshotel277.053.31484946.68515126.65742425.27848373.84106670.021398143.862464
12022121Ciutat Vella1el RavalApartaments Turísticshomes_for_tourist_use170.057.47060642.52939428.73530327.24887748.85001546.32309195.173106
22022121Ciutat Vella1el RavalHabitatges d’Ús Turístichomes_for_tourist_use1234.057.47060642.52939428.73530327.248877354.593641336.251139690.844780
32022121Ciutat Vella1el RavalHotel 1 estrellahotel416.053.31484946.68515126.65742425.278483110.894886105.158490216.053376
42022121Ciutat Vella1el RavalHotel 2 estrelleshotel430.053.31484946.68515126.65742425.278483114.626925108.697477223.324403
32232022110Sant Martí70el Besòs i el MaresmeHotel 4 estrelles superiorhotel524.028.74099271.25900814.37049614.16869875.30140074.243977149.545377
32242022110Sant Martí71Provençals del PoblenouHabitatges d’Ús Turístichomes_for_tourist_use180.036.07247963.92752118.03623917.78296532.46523132.00933764.474567
32252022110Sant Martí71Provençals del PoblenouHotel 4 estrelleshotel356.028.74099271.25900814.37049614.16869851.15896750.440565101.599531
32262022110Sant Martí72Sant Martí de ProvençalsHabitatges d’Ús Turístichomes_for_tourist_use82.036.07247963.92752118.03623917.78296514.78971614.58203129.371747
32272022110Sant Martí73la Verneda i la PauHabitatges d’Ús Turístichomes_for_tourist_use49.036.07247963.92752118.03623917.7829658.8377578.71365317.551410

3228 rows × 16 columns

VERIFICATION

data tested on month 11

In [124]:

q = """
SELECT 
(SELECT SUM(N_FOREIGN_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='hotel' AND month=11)*30/2.559062 +
(SELECT SUM(N_FOREIGN_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11)*30/4.475032,
SUM(NEIGHBOURHOOD_N_PLACES)
FROM 
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022
WHERE month=11
;"""
duckdb.query(q).df()

Out[124]:

((((SELECT sum(“N_FOREIGN_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) * 30) / 2.559062) + (((SELECT sum(“N_FOREIGN_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11))) * 30) / 4.475032))sum(“NEIGHBOURHOOD_N_PLACES”)
0520640.96227144605.0

In [125]:

q = """
SELECT SUM(FT_N_TOURISTS)
FROM 
df_tourism_ft
WHERE FT_MONTH=11 AND FT_YEAR=2022
;"""
duckdb.query(q).df()

Out[125]:

sum(“FT_N_TOURISTS”)
0520641.0

In [126]:

q = """
SELECT SUM(n_places)
FROM df_tourism_tl
;"""
duckdb.query(q).df()

Out[126]:

sum(n_places)
0144605.0

In [127]:

q = """
SELECT 
(SELECT SUM(TOTAL_N_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='hotel' AND month=11)*30/2.559062 +
(SELECT SUM(TOTAL_N_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11)*30/4.475032
FROM 
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022
WHERE month=11
LIMIT 1
;"""
duckdb.query(q).df()

Out[127]:

((((SELECT sum(“TOTAL_N_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) * 30) / 2.559062) + (((SELECT sum(“TOTAL_N_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11))) * 30) / 4.475032))
0773736.943928

In [129]:

q = """
SELECT 
(SELECT SUM(N_TOURISTS_PER_LODGING_TYPE_PER_MONTH) FROM df_lodging_share_total_n_tourists_2022 WHERE LODGING_TYPE='hotel' AND month=11) +
(SELECT SUM(N_TOURISTS_PER_LODGING_TYPE_PER_MONTH) FROM df_lodging_share_total_n_tourists_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11)
FROM 
df_lodging_share_total_n_tourists_2022
WHERE month=11
LIMIT 1
;"""
duckdb.query(q).df()

Out[129]:

((SELECT sum(“N_TOURISTS_PER_LODGING_TYPE_PER_MONTH”) FROM df_lodging_share_total_n_tourists_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) + (SELECT sum(“N_TOURISTS_PER_LODGING_TYPE_PER_MONTH”) FROM df_lodging_share_total_n_tourists_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11))))
0773737.0

EXCEL EXPORT FILE

NORMALIZATION

In [130]:

#ROUNDING VALUES
q = """
SELECT year, month, district_code, district_name, neighbourhood_code, neighbourhood_name, category, LODGING_TYPE, NEIGHBOURHOOD_N_PLACES,
PERCENTAGE_BOOKED_PLACES,PERCENTAGE_FREE_PLACES,PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS,PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS,
ROUND(N_NATIONAL_TOURISTS_PER_DAY) AS 'N_NATIONAL_TOURISTS_PER_DAY',	
ROUND(N_FOREIGN_TOURISTS_PER_DAY) AS 'N_FOREIGN_TOURISTS_PER_DAY',
ROUND(TOTAL_N_TOURISTS_PER_DAY) AS 'TOTAL_N_TOURISTS_PER_DAY'
FROM 
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022
;"""
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022 = duckdb.query(q).df()
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022

Out[130]:

yearmonthdistrict_codedistrict_nameneighbourhood_codeneighbourhood_namecategoryLODGING_TYPENEIGHBOURHOOD_N_PLACESPERCENTAGE_BOOKED_PLACESPERCENTAGE_FREE_PLACESPERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTSPERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTSN_NATIONAL_TOURISTS_PER_DAYN_FOREIGN_TOURISTS_PER_DAYTOTAL_N_TOURISTS_PER_DAY
02022121Ciutat Vella1el RavalAlbergshotel277.053.31484946.68515126.65742425.27848374.070.0144.0
12022121Ciutat Vella1el RavalApartaments Turísticshomes_for_tourist_use170.057.47060642.52939428.73530327.24887749.046.095.0
22022121Ciutat Vella1el RavalHabitatges d’Ús Turístichomes_for_tourist_use1234.057.47060642.52939428.73530327.248877355.0336.0691.0
32022121Ciutat Vella1el RavalHotel 1 estrellahotel416.053.31484946.68515126.65742425.278483111.0105.0216.0
42022121Ciutat Vella1el RavalHotel 2 estrelleshotel430.053.31484946.68515126.65742425.278483115.0109.0223.0
32232022110Sant Martí70el Besòs i el MaresmeHotel 4 estrelles superiorhotel524.028.74099271.25900814.37049614.16869875.074.0150.0
32242022110Sant Martí71Provençals del PoblenouHabitatges d’Ús Turístichomes_for_tourist_use180.036.07247963.92752118.03623917.78296532.032.064.0
32252022110Sant Martí71Provençals del PoblenouHotel 4 estrelleshotel356.028.74099271.25900814.37049614.16869851.050.0102.0
32262022110Sant Martí72Sant Martí de ProvençalsHabitatges d’Ús Turístichomes_for_tourist_use82.036.07247963.92752118.03623917.78296515.015.029.0
32272022110Sant Martí73la Verneda i la PauHabitatges d’Ús Turístichomes_for_tourist_use49.036.07247963.92752118.03623917.7829659.09.018.0

3228 rows × 16 columns

In [131]:

#distric_code and neighbourhood_code need a '0' in front of all numbers below 10 to combine with spacial data
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['district_code'] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['district_code'].astype(str).str.strip()
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['neighbourhood_code'] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['neighbourhood_code'].astype(str).str.strip()
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['district_code']] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['district_code']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['neighbourhood_code']] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['neighbourhood_code']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [132]:

df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022

Out[132]:

yearmonthdistrict_codedistrict_nameneighbourhood_codeneighbourhood_namecategoryLODGING_TYPENEIGHBOURHOOD_N_PLACESPERCENTAGE_BOOKED_PLACESPERCENTAGE_FREE_PLACESPERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTSPERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTSN_NATIONAL_TOURISTS_PER_DAYN_FOREIGN_TOURISTS_PER_DAYTOTAL_N_TOURISTS_PER_DAY
020221201Ciutat Vella01el RavalAlbergshotel277.053.31484946.68515126.65742425.27848374.070.0144.0
120221201Ciutat Vella01el RavalApartaments Turísticshomes_for_tourist_use170.057.47060642.52939428.73530327.24887749.046.095.0
220221201Ciutat Vella01el RavalHabitatges d’Ús Turístichomes_for_tourist_use1234.057.47060642.52939428.73530327.248877355.0336.0691.0
320221201Ciutat Vella01el RavalHotel 1 estrellahotel416.053.31484946.68515126.65742425.278483111.0105.0216.0
420221201Ciutat Vella01el RavalHotel 2 estrelleshotel430.053.31484946.68515126.65742425.278483115.0109.0223.0
32232022110Sant Martí70el Besòs i el MaresmeHotel 4 estrelles superiorhotel524.028.74099271.25900814.37049614.16869875.074.0150.0
32242022110Sant Martí71Provençals del PoblenouHabitatges d’Ús Turístichomes_for_tourist_use180.036.07247963.92752118.03623917.78296532.032.064.0
32252022110Sant Martí71Provençals del PoblenouHotel 4 estrelleshotel356.028.74099271.25900814.37049614.16869851.050.0102.0
32262022110Sant Martí72Sant Martí de ProvençalsHabitatges d’Ús Turístichomes_for_tourist_use82.036.07247963.92752118.03623917.78296515.015.029.0
32272022110Sant Martí73la Verneda i la PauHabitatges d’Ús Turístichomes_for_tourist_use49.036.07247963.92752118.03623917.7829659.09.018.0

3228 rows × 16 columns

In [133]:

df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022.to_excel('T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022.xlsx', sheet_name='sql_touristsXday')

DATABASE UPLOAD

In [134]:

df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022.to_gbq(
        destination_table="TOURISM.T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022",
        project_id=project_id)

In [135]:

gbq.read_gbq(query='SELECT table_name FROM TOURISM.INFORMATION_SCHEMA.TABLES;')

Out[135]:

table_name
0TL
1TT
2FT
3T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…

In [137]:

gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022";')

Out[137]:

table_schematable_namecolumn_namedata_type
0TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…yearINT64
1TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…monthINT64
2TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…district_codeSTRING
3TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…district_nameSTRING
4TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…neighbourhood_codeSTRING
5TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…neighbourhood_nameSTRING
6TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…categorySTRING
7TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…LODGING_TYPESTRING
8TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…NEIGHBOURHOOD_N_PLACESFLOAT64
9TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…PERCENTAGE_BOOKED_PLACESFLOAT64
10TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…PERCENTAGE_FREE_PLACESFLOAT64
11TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOU…FLOAT64
12TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOUR…FLOAT64
13TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…N_NATIONAL_TOURISTS_PER_DAYFLOAT64
14TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…N_FOREIGN_TOURISTS_PER_DAYFLOAT64
15TOURISMT_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE…TOTAL_N_TOURISTS_PER_DAYFLOAT64

JUPYTER LAB – Barcelona Tourist Lodgings



BCN_TOURISM_CODE – PD_ONLY




TROUBLESHOOTING VERSION

BARCELONA TOURISM

TOURIST LODGINGS – CAPACITY AND DISTRIBUTION BY NEIGHBOURHOOD
REQUIRED LIBRARIES

In [536]:

#FOR DATA EXTRACTION AND CLEANING:
import pandas as pd
import requests

DATA EXTRACTION AND CLEANING

DATAFRAME: DISTRICT_NEIGHBOURHOOD_TABLE

This file contains the codes and names of districts and neighbourhoods used in other projects.
The file is already cleaned and used here as a conversion table to make data compatible for visualizations with spatial files.

In [537]:

df_district_neighbourhood_table = pd.read_excel('F:DPortfolio ProjectsBARCELONA TOURISMT_NEIGHBOURHOODS.xlsx',converters={'District_Code':str,'Neighbourhood_Code':str})
df_district_neighbourhood_table

Out[537]:

District_CodeDistrict_NameNeighbourhood_CodeNeighbourhood_Name
001Ciutat Vella01el Raval
101Ciutat Vella02el Barri Gòtic
201Ciutat Vella03la Barceloneta
301Ciutat Vella04Sant Pere, Santa Caterina i la Ribera
402Eixample05el Fort Pienc
6810Sant Martí69Diagonal Mar i el Front Marítim del Poblenou
6910Sant Martí70el Besòs i el Maresme
7010Sant Martí71Provençals del Poblenou
7110Sant Martí72Sant Martí de Provençals
7210Sant Martí73la Verneda i la Pau

73 rows × 4 columns

DATAFRAME: NUMBER OF PLACES BY ESTABLISHMENT

SOURCE:

https://ajuntament.barcelona.cat/ecologiaurbana/ca/tramits/activitats/cens

http://w121.bcn.cat/APPS/censactivitats/cceatDef.do?reqCode=search

In [538]:

df_np = pd.read_excel('extraccio.xlsx')
df_np.head(1)

Out[538]:

Núm. ExpedientRTCDescripció categoriaEmplaçamentTipus carrerCarrerPrimer númeroPrimera lletraSegon númeroSegona lletraBlocPortalEscalaPisPortaBarriDistricteZonaDescripció zonaNúm. places allotjament turístic
001-90-A-128HB-003893Hotel 3 estrellesGRAVINA 5 7NaNGRAVINA5NaN7.0NaNNaNNaNNaNNaNNaNel RavalCIUTAT VELLAZE-1ZONA DE DECREIXEMENT NATURAL86.0

In [539]:

df_np1 = df_np.copy()

EXTRACT RELEVANT DATA

In [540]:

#DROP THE ZONE AND ZONE_DESCRIPTION COLUMNS
df_np1.drop(columns = ['Zona','Descripció zona'], inplace = True)

In [541]:

#RENAME COLUMNS IN ENGLISH
df_np1.columns = ['n_practice','rtc','category','address','street_type',
                       'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block',
                       'entrance','stair','floor','door','neighbourhood_name','district_name','n_places']
df_np1.head(1)

Out[541]:

n_practicertccategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2street_letter_2blockentrancestairfloordoorneighbourhood_namedistrict_namen_places
001-90-A-128HB-003893Hotel 3 estrellesGRAVINA 5 7NaNGRAVINA5NaN7.0NaNNaNNaNNaNNaNNaNel RavalCIUTAT VELLA86.0

In [542]:

#ADD EMPTY COLUMNS FOR RELEVANT DATA TO ADD FROM OTHER TABLES:
df_np1['neighbourhood_code'] = None
df_np1['district_code'] = None
df_np1['longitude'] = None
df_np1['latitude'] = None
df_np1['name'] = None

In [543]:

#REARRANGE COLUMNS IN THE PREFERRED ORDER
df_np1.columns

Out[543]:

Index(['n_practice', 'rtc', 'category', 'address', 'street_type', 'street',
       'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'neighbourhood_name', 'district_name', 'n_places', 'neighbourhood_code',
       'district_code', 'longitude', 'latitude', 'name'],
      dtype='object')

In [544]:

#REARRANGE COLUMNS IN THE PREFERRED ORDER
df_np1 = df_np1[['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street',
       'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
        'district_code','district_name', 'neighbourhood_code','neighbourhood_name',
        'longitude', 'latitude', 'n_places']]
df_np1.head(1)

Out[544]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7NaNGRAVINA5NaN7.0NaNNaNNaNNoneCIUTAT VELLANoneel RavalNoneNone86.0

1 rows × 23 columns

In [545]:

#CHECK THE CATEGORY ATTRIBUTE
df_np1['category'].value_counts()

Out[545]:

Habitatges d'Ús Turístic                 9409
Pensió                                    286
Hotel 4 estrelles                         144
Albergs                                   126
Hotel 3 estrelles                         119
Residències estudiants en sòl de zona      64
Hotel 1 estrella                           47
Hotel 2 estrelles                          42
Hotel 4 estrelles superior                 27
Hotel 5 estrelles                          24
Hotel gran luxe                            20
Hotel-Apart 4 estrelles                    12
Apartaments Turístics                      12
Hotel-Apart 3 estrelles                    10
Hotel-Apart 2 estrelles                     6
Hotel-Apart 1 estrella                      2
Hotel-Apart 4 estrelles superior            1
Name: category, dtype: int64

In [546]:

#SINCE THE STUDY FOCUSES ON ESTABLISHMENT FOR TOURIST, THE CATEGORY 'Residències estudiants en sòl de zona' (STUDENT RESIDENCIES) IS DROPPED
df_np1 = df_np1[df_np1['category']!='Residències estudiants en sòl de zona']
df_np1[df_np1['category']=='Residències estudiants en sòl de zona'].head(1)

Out[546]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [547]:

df_np2 = df_np1.copy()

In [548]:

df_np2.head(5)

Out[548]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7NaNGRAVINA5NaN7.0NaNNaNNaNNoneCIUTAT VELLANoneel RavalNoneNone86.0
101-87-A-372248HB-003827NonePensióTALLERS 6 8NaNTALLERS6NaN8.0NaNNaNNaNNoneCIUTAT VELLANoneel RavalNoneNone11.0
200-2002-0134HB-004190NoneHotel 4 estrellesAV BOGATELL 64 66AVBOGATELL64NaN66.0NaNNaNNaNNoneSANT MARTINonela Vila Olímpica del PoblenouNoneNone440.0
307-2013-0168HUTB-007570NoneHabitatges d’Ús TurísticAV CAN BARO 22 3 1AVCAN BARO22NaNNaNNaN31NoneHORTA-GUINARDÓNoneCan BaróNoneNone3.0
407-2014-0121HUTB-009724NoneHabitatges d’Ús TurísticAV CAN BARO 3 1 2AVCAN BARO3NaNNaNNaN12NoneHORTA-GUINARDÓNoneCan BaróNoneNone4.0

5 rows × 23 columns

WHITESPACES

In [549]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_np2['n_practice'] = df_np2['n_practice'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['rtc'] = df_np2['rtc'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['category'] = df_np2['category'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['address'] = df_np2['address'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['street_type'] = df_np2['street_type'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['street'] = df_np2['street'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['street_number_1'] = df_np2['street_number_1'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['street_letter_1'] = df_np2['street_letter_1'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['street_number_2'] = df_np2['street_number_2'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['street_letter_2'] = df_np2['street_letter_2'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['block'] = df_np2['block'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['entrance'] = df_np2['entrance'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['stair'] = df_np2['stair'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['floor'] = df_np2['floor'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['door'] = df_np2['door'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_np2['district_name'] = df_np2['district_name'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['neighbourhood_name'] = df_np2['neighbourhood_name'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_np2['n_places'] = df_np2['n_places'].astype(float)

In [550]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_np2 = df_np2.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)

In [551]:

df_np3 = df_np2.copy()

DUPLICATES

In [552]:

df_np3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10287 entries, 0 to 10350
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          10287 non-null  object 
 1   rtc                 10287 non-null  object 
 2   name                0 non-null      object 
 3   category            10287 non-null  object 
 4   address             10287 non-null  object 
 5   street_type         10287 non-null  object 
 6   street              10287 non-null  object 
 7   street_number_1     10287 non-null  object 
 8   street_letter_1     125 non-null    object 
 9   street_number_2     886 non-null    object 
 10  street_letter_2     4 non-null      object 
 11  block               14 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               690 non-null    object 
 14  floor               9691 non-null   object 
 15  door                8697 non-null   object 
 16  district_code       0 non-null      object 
 17  district_name       10287 non-null  object 
 18  neighbourhood_code  0 non-null      object 
 19  neighbourhood_name  10287 non-null  object 
 20  longitude           0 non-null      object 
 21  latitude            0 non-null      object 
 22  n_places            10270 non-null  float64
dtypes: float64(1), object(22)
memory usage: 1.9+ MB

In [553]:

#PRELIMINARY CHECK FOR DUPLICATES
df_np3.duplicated().value_counts()

Out[553]:

False    10287
dtype: int64
ADDRESS

In [554]:

#CHECK FOR DUPLICATES BY EXCLUDING: N_PRACTICE, RTC AND N_PLACES - TO SEE IF THERE ARE DUPLICATES ONLY IN TERMS OF CATEGORY AND ADDRESS
df_np3[df_np3.duplicated(subset=['category','address','street_type',
                       'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block',
                       'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False)].sort_values('rtc')

Out[554]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
450205-2009-0237ALB-427NoneAlbergsC MAJOR DEL RECTORET 2CMAJOR DEL RECTORET2NoneNoneNoneNoneNoneNoneSARRIA-SANT GERVASINoneVallvidrera, el Tibidabo i les PlanesNoneNone247.0
450305-2004-0005ALB-427NoneAlbergsC MAJOR DEL RECTORET 2CMAJOR DEL RECTORET2NoneNoneNoneNoneNoneNoneSARRIA-SANT GERVASINoneVallvidrera, el Tibidabo i les PlanesNoneNoneNaN

2 rows × 23 columns

In [555]:

#FROM THE PREVIOUS CHECK, THERE APPEAR TO BE 1 CASE WITH 2 DISTINCT N_PRACTICE RECORDS WHERE THE VARIABLE OF INTEREST - N_PLACES - IS ONLY PRESENT IN ONE OF THE DUPLICATED VALUES. 
#TO SELECT THE DUPLICATED VALUES WITH NO VARIABLE OF INTEREST:
df_np3[df_np3.duplicated(subset=['category','address','street_type',
                       'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block',
                       'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False) & (df_np2['n_places'].isnull() | (df_np2['n_places']==''))]

Out[555]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
450305-2004-0005ALB-427NoneAlbergsC MAJOR DEL RECTORET 2CMAJOR DEL RECTORET2NoneNoneNoneNoneNoneNoneSARRIA-SANT GERVASINoneVallvidrera, el Tibidabo i les PlanesNoneNoneNaN

1 rows × 23 columns

In [556]:

#DROP THE DUPLICATED CASES WITH NO VARIBLE OF INTEREST
df_np3.drop(df_np3[(df_np3.duplicated(subset=['category','address','street_type',
                       'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block',
                       'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False)) 
                   & (df_np3['n_places'].isnull() | (df_np3['n_places']==''))].index, inplace=True)

In [557]:

#VERIFY THAT THE RIGHT RECORDS ARE GONE
df_np3[df_np3['n_practice']=='05-2004-0005']

Out[557]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [558]:

df_np4 = df_np3.copy()
N_PRACTICE

In [559]:

#CHECK IF N_PRACTICE CONTAINS NULL VALUES
df_np4[(df_np4['n_practice'].isnull()) | (df_np4['n_practice']== None) 
                    | (df_np4['n_practice']=='nan') | (df_np4['n_practice']=='')]

Out[559]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [560]:

#CHECK IF N_PRACTICE CONTAINS DUPLICATED VALUES
df_np4['n_practice'].value_counts()

Out[560]:

10-2001-0694    2
06-2017-0212    2
02-2015-0048    2
06-2010-0423    2
01-90-A-128     1
               ..
06-2013-0508    1
06-2012-0606    1
06-2012-0353    1
06-2014-0361    1
01-NT-0075      1
Name: n_practice, Length: 10282, dtype: int64

In [561]:

#FIND N_PRACTICE DUPLICATED VALUES
df_np4[df_np4.duplicated(subset=['n_practice'], keep=False)].sort_values('n_practice')

Out[561]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
873702-2015-0048ALB-625NoneAlbergsG.V. CORTS CATALANES 580 BJG.V.CORTS CATALANES580NoneNoneNoneBJNoneNoneL’EIXAMPLENoneSant AntoniNoneNone216.0
873802-2015-0048HB-004629NoneHotel 1 estrellaG.V. CORTS CATALANES 580 BJG.V.CORTS CATALANES580NoneNoneNoneBJNoneNoneL’EIXAMPLENoneSant AntoniNoneNone42.0
260906-2010-0423ALB-556NoneAlbergsC CORSEGA 373 375CCORSEGA373None375NoneNoneNoneNoneGRACIANonela Vila de GràciaNoneNone646.0
261006-2010-0423HB-004525NoneHotel 1 estrellaC CORSEGA 373 375CCORSEGA373None375NoneNoneNoneNoneGRACIANonela Vila de GràciaNoneNone81.0
903806-2017-0212ALB-565NoneAlbergsPG GRACIA 116PGGRACIA116NoneNoneNoneNoneNoneNoneGRACIANonela Vila de GràciaNoneNone446.0
903906-2017-0212HB-004682NoneHotel 1 estrellaPG GRACIA 116PGGRACIA116NoneNoneNoneNoneNoneNoneGRACIANonela Vila de GràciaNoneNone23.0
566710-2001-0694HB-004532NoneHotel 5 estrellesC PERE IV 272CPERE IV272NoneNoneNoneNoneNoneNoneSANT MARTINoneel PoblenouNoneNone86.0
566810-2001-0694HB-004358NoneHotel 4 estrelles superiorC PERE IV 272CPERE IV272NoneNoneNoneNoneNoneNoneSANT MARTINoneel PoblenouNoneNone430.0

8 rows × 23 columns

The records above appear to be related to cases where establishments have sections belonging to different categories within the same establishment – therefore, these are not dropped

In [562]:

df_np5 = df_np4.copy()
RTC

In [563]:

#CHECK IF RTC CONTAINS NULL VALUES
df_np5[(df_np5['rtc'].isnull()) | (df_np5['rtc']== None) 
                    | (df_np5['rtc']=='nan') | (df_np5['rtc']=='')]

Out[563]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [564]:

#CHECK IF RTC CONTAINS DUPLICATED VALUES
df_np5['rtc'].value_counts().head(6)

Out[564]:

Pendent      68
HB-000957     3
HB-001951     2
HB-004669     2
ALB-562       2
HB-003893     1
Name: rtc, dtype: int64

In [565]:

#RTC DUPLICATED VALUES WITH "Pendent" (PENDING) VALUES - CATEGORY TYPES
df_np5[df_np5['rtc'] == 'Pendent'].value_counts('category')

Out[565]:

category
Habitatges d'Ús Turístic    48
Pensió                       7
Hotel 3 estrelles            4
Hotel 4 estrelles            3
Albergs                      2
Hotel 5 estrelles            2
Hotel 1 estrella             1
Hotel 2 estrelles            1
dtype: int64

In [566]:

#RTC DUPLICATED VALUES - EXCLUDING "Pendent" (PENDING) VALUES
df_np5[(df_np5.duplicated(subset=['rtc'], keep=False)) & (df_np5['rtc'] != 'Pendent')].sort_values('rtc')

Out[566]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
14805-2013-0284ALB-562NoneAlbergsAV DIAGONAL 578 3AVDIAGONAL578NoneNoneNone3NoneNoneSARRIA-SANT GERVASINoneSant Gervasi – GalvanyNoneNone19.0
14905-2016-0268ALB-562NoneAlbergsAV DIAGONAL 578 5AVDIAGONAL578NoneNoneNone5NoneNoneSARRIA-SANT GERVASINoneSant Gervasi – GalvanyNoneNone19.0
342903-1999-0004HB-000957NonePensióC FONTRODONA 1 1 1CFONTRODONA1NoneNoneNone11NoneSANTS-MONTJUÏCNoneel Poble SecNoneNoneNaN
343003-1998-0472HB-000957NonePensióC FONTRODONA 1 2 3CFONTRODONA1NoneNoneNone23NoneSANTS-MONTJUÏCNoneel Poble SecNoneNoneNaN
343103-2002-0222HB-000957NonePensióC FONTRODONA 1 EN 3CFONTRODONA1NoneNoneNoneEN3NoneSANTS-MONTJUÏCNoneel Poble SecNoneNone23.0
752801-89-A-200HB-001951NonePensióC TALLERS 82 1º 1ªCTALLERS82NoneNoneNoneNoneCIUTAT VELLANoneel RavalNoneNone25.0
752901-91-A-116HB-001951NonePensióC TALLERS 82 2º 1ªCTALLERS82NoneNoneNoneNoneCIUTAT VELLANoneel RavalNoneNoneNaN
648502-2016-1310HB-004669NoneHotel 4 estrellesC ROGER DE LLURIA 17CROGER DE LLURIA17NoneNoneNoneNoneNoneNoneL’EIXAMPLENonela Dreta de l’EixampleNoneNone68.0
874002-2016-1367HB-004669NonePensióG.V. CORTS CATALANES 584 3 2G.V.CORTS CATALANES584NoneNoneNone32NoneL’EIXAMPLENoneSant AntoniNoneNone14.0

9 rows × 23 columns

The first 7 records refer to different floors within the same establishments.
However, only in one case the variable of interest – n_places – is indicated for all floors.
In the other cases, the varible of interest – n_places – is indicated only in one of the floors.

Records are not modified as it might be possible to correct them by crossreferencing them with other tables later.

The last 2 records might be a case where the RTC code was wrongly inputed as the same RTC code refers to 2 completely different establishments.
By crossreferencing with later tables, a match was found only for n_practice 02-2016-1310. Therefore, the rtc value for the other record n_practice 02-2016-1367 is modified with the suffix ERROR.

In [567]:

df_np5.loc[df_np5['n_practice']=='02-2016-1367', 'rtc'] = 'ERROR-HB-004669'
df_np5[df_np5['n_practice']=='02-2016-1367']

Out[567]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
874002-2016-1367ERROR-HB-004669NonePensióG.V. CORTS CATALANES 584 3 2G.V.CORTS CATALANES584NoneNoneNone32NoneL’EIXAMPLENoneSant AntoniNoneNone14.0

1 rows × 23 columns

In [568]:

df_np5[df_np5['n_practice']=='02-2016-1310']

Out[568]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
648502-2016-1310HB-004669NoneHotel 4 estrellesC ROGER DE LLURIA 17CROGER DE LLURIA17NoneNoneNoneNoneNoneNoneL’EIXAMPLENonela Dreta de l’EixampleNoneNone68.0

1 rows × 23 columns

In [569]:

df_np6 = df_np5.copy()

MISSING VALUES

In [570]:

#QUICK CHECK FOR MISSING VALUES
df_np6.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10286 entries, 0 to 10350
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          10286 non-null  object 
 1   rtc                 10286 non-null  object 
 2   name                0 non-null      object 
 3   category            10286 non-null  object 
 4   address             10286 non-null  object 
 5   street_type         10286 non-null  object 
 6   street              10286 non-null  object 
 7   street_number_1     10286 non-null  object 
 8   street_letter_1     125 non-null    object 
 9   street_number_2     886 non-null    object 
 10  street_letter_2     4 non-null      object 
 11  block               14 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               690 non-null    object 
 14  floor               9691 non-null   object 
 15  door                8697 non-null   object 
 16  district_code       0 non-null      object 
 17  district_name       10286 non-null  object 
 18  neighbourhood_code  0 non-null      object 
 19  neighbourhood_name  10286 non-null  object 
 20  longitude           0 non-null      object 
 21  latitude            0 non-null      object 
 22  n_places            10270 non-null  float64
dtypes: float64(1), object(22)
memory usage: 1.9+ MB
N_PLACES

In [571]:

#CHECK WHICH CATEGORIES HAVE MISSING VALUES OF INTEREST - N PLACES
df_np6[(df_np6['n_places'].isnull()) | (df_np6['n_places']== None) 
                    | (df_np6['n_places']=='nan') | (df_np6['n_places']=='')].value_counts('category')

Out[571]:

category
Habitatges d'Ús Turístic    13
Pensió                       3
dtype: int64

NO MODIFICATIONS MADE FOR NOW HOW AS IT MIGHT BE POSSIBLE TO RECOVER SOME OF THE DATA FROM THE OTHER TABLES

DISTRICT – NEIGHBOURHOOD

Comparing table with district and neighbourhood table

In [572]:

df_np6[(df_np6['district_name'].isnull()) | (df_np6['district_name']== None)
                    | (df_np6['district_name']=='nan') | (df_np6['district_name']=='')].shape[0]

Out[572]:

9

In [573]:

df_np6[(df_np6['neighbourhood_name'].isnull()) | (df_np6['neighbourhood_name']== None) 
                    | (df_np6['neighbourhood_name']=='nan') | (df_np6['neighbourhood_name']=='')].shape[0]

Out[573]:

9

In [574]:

df_np6['district_name'].isin(df_district_neighbourhood_table['District_Name']).value_counts()

Out[574]:

False    10286
Name: district_name, dtype: int64

In [575]:

df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name']).value_counts()

Out[575]:

True     9759
False     527
Name: neighbourhood_name, dtype: int64

In [576]:

df_np6[['district_name','neighbourhood_name']][(~df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])) 
                                               & (df_np6['neighbourhood_name'].notnull())].sort_values('neighbourhood_name')

Out[576]:

district_nameneighbourhood_name
5279SANTS-MONTJUÏCel Poble Sec
6080SANTS-MONTJUÏCel Poble Sec
6079SANTS-MONTJUÏCel Poble Sec
6078SANTS-MONTJUÏCel Poble Sec
6034SANTS-MONTJUÏCel Poble Sec
6312nannan
4032nannan
7050nannan
4276nannan
318nannan

527 rows × 2 columns

In [577]:

df_district_neighbourhood_table[df_district_neighbourhood_table['Neighbourhood_Name'].str.contains('Poble')]

Out[577]:

District_CodeDistrict_NameNeighbourhood_CodeNeighbourhood_Name
1003Sants-Montjuïc11el Poble-sec
6510Sant Martí66el Parc i la Llacuna del Poblenou
6610Sant Martí67la Vila Olímpica del Poblenou
6710Sant Martí68el Poblenou
6810Sant Martí69Diagonal Mar i el Front Marítim del Poblenou
7010Sant Martí71Provençals del Poblenou

In [578]:

df_np6.loc[df_np6['neighbourhood_name']=='el Poble Sec','neighbourhood_name'] = 'el Poble-sec'

In [579]:

df_np6[['district_name','neighbourhood_name']][(~df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])) 
                                               & (df_np6['neighbourhood_name'].notnull())].sort_values('neighbourhood_name')

Out[579]:

district_nameneighbourhood_name
318nannan
4032nannan
4276nannan
6310nannan
6311nannan
6312nannan
7050nannan
7617nannan
7686nannan

In [580]:

df_np6[(df_np6['neighbourhood_name'].isnull()) | (df_np6['neighbourhood_name']=='') | (df_np6['neighbourhood_name']==None)| (df_np6['neighbourhood_name']=='nan') 
       | (df_np6['district_name'].isnull()) | (df_np6['district_name']=='') | (df_np6['district_name']==None)| (df_np6['district_name']=='nan')]

Out[580]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
31810-2014-0475HUTB-011359NoneHabitatges d’Ús TurísticAV MERIDIANA 109 03 1AVMERIDIANA109NoneNoneNone031NonenanNonenanNoneNone4.0
403202-2013-0207HUTB-005185NoneHabitatges d’Ús TurísticC INDUSTRIA 175 02 2CINDUSTRIA175NoneNoneNone022NonenanNonenanNoneNone4.0
427606-2014-0281HUTB-009991NoneHabitatges d’Ús TurísticC JOSEP TORRES 26 04 2CJOSEP TORRES26NoneNoneNone042NonenanNonenanNoneNone6.0
631006-2013-0206HUTB-005620NoneHabitatges d’Ús TurísticC RIERA DE SANT MIQUEL 49 01 3CRIERA DE SANT MIQUEL49NoneNoneNone013NonenanNonenanNoneNone6.0
631106-2013-0207HUTB-005624NoneHabitatges d’Ús TurísticC RIERA DE SANT MIQUEL 49 02 1CRIERA DE SANT MIQUEL49NoneNoneNone021NonenanNonenanNoneNone6.0
631206-2013-0208HUTB-005622NoneHabitatges d’Ús TurísticC RIERA DE SANT MIQUEL 49 02 2CRIERA DE SANT MIQUEL49NoneNoneNone022NonenanNonenanNoneNone6.0
705006-2012-0286HUTB-002334NoneHabitatges d’Ús TurísticC SANT SALVADOR 20 02 1CSANT SALVADOR20NoneNoneNone021NonenanNonenanNoneNone6.0
761704-2014-0216HUTB-012140NoneHabitatges d’Ús TurísticC TAQUIGRAF SERRA 1 02 5CTAQUIGRAF SERRA1NoneNoneNone025NonenanNonenanNoneNone3.0
768602-2022-0524HUTB-064279NoneHabitatges d’Ús TurísticC TARRAGONA 84 B 5 1CTARRAGONA84NoneNoneB51NonenanNonenanNoneNone1.0

9 rows × 23 columns

MANUAL INPUTATION AS MISSING RECORDS ARE ONLY 7

AV MERIDIANA 109 – neighbourhood_name: el Clot neighbourhood_code: 65
C INDUSTRIA 175 – neighbourhood_name: la Sagrada Família neighbourhood_code: 06
C JOSEP TORRES 26 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C RIERA DE SANT MIQUEL 49 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C SANT SALVADOR 20 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C TAQUIGRAF SERRA 1 – neighbourhood_name: les Corts neighbourhood_code: 19
C TARRAGONA 84 – neighbourhood_name: la Nova Esquerra de l’Eixample neighbourhood_code: 09

SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [581]:

df_np7 = df_np6.copy()

In [582]:

#FILL IN MISSING VALUES
df_np7.loc[df_np7['n_practice']=='10-2014-0475','neighbourhood_code'] = "65"
df_np7.loc[df_np7['n_practice']=='10-2014-0475','neighbourhood_name'] = "el Clot"
df_np7.loc[df_np7['n_practice']=='02-2013-0207','neighbourhood_code'] = "06"
df_np7.loc[df_np7['n_practice']=='02-2013-0207','neighbourhood_name'] = "la Sagrada Família"
df_np7.loc[df_np7['n_practice']=='06-2014-0281','neighbourhood_code'] = "31"
df_np7.loc[df_np7['n_practice']=='06-2014-0281','neighbourhood_name'] = "la Vila de Gràcia"
df_np7.loc[df_np7['n_practice']=='06-2013-0206','neighbourhood_code'] = "31"
df_np7.loc[df_np7['n_practice']=='06-2013-0206','neighbourhood_name'] = "la Vila de Gràcia"
df_np7.loc[df_np7['n_practice']=='06-2013-0207','neighbourhood_code'] = "31"
df_np7.loc[df_np7['n_practice']=='06-2013-0207','neighbourhood_name'] = "la Vila de Gràcia"
df_np7.loc[df_np7['n_practice']=='06-2013-0208','neighbourhood_code'] = "31"
df_np7.loc[df_np7['n_practice']=='06-2013-0208','neighbourhood_name'] = "la Vila de Gràcia"
df_np7.loc[df_np7['n_practice']=='06-2012-0286','neighbourhood_code'] = '31'
df_np7.loc[df_np7['n_practice']=='06-2012-0286','neighbourhood_name'] = 'la Vila de Gràcia'
df_np7.loc[df_np7['n_practice']=='04-2014-0216','neighbourhood_code'] = '19'
df_np7.loc[df_np7['n_practice']=='04-2014-0216','neighbourhood_name'] = 'les Corts'
df_np7.loc[df_np7['n_practice']=='02-2022-0524','neighbourhood_code'] = "09"
df_np7.loc[df_np7['n_practice']=='02-2022-0524','neighbourhood_name'] = "la Nova Esquerra de l'Eixample"

In [583]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_np7['neighbourhood_code'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['Neighbourhood_Code'])
df_np7['district_code'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['District_Code'])
df_np7['district_name'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['District_Name'])

In [584]:

#CHECK
df_np7[(df_np7['neighbourhood_name'].isnull()) | (df_np7['neighbourhood_name']=='nan') |  (df_np7['neighbourhood_name']==None) |  (df_np7['neighbourhood_name']=='')]

Out[584]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

It might be possible to find and fill in the missing data later on when crossreferencing with other tables. Therefore, records kept incomplete for now.

In [585]:

df_np8 = df_np7.copy()

NORMALIZATION

In [586]:

df_np8.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10286 entries, 0 to 10350
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          10286 non-null  object 
 1   rtc                 10286 non-null  object 
 2   name                0 non-null      object 
 3   category            10286 non-null  object 
 4   address             10286 non-null  object 
 5   street_type         10286 non-null  object 
 6   street              10286 non-null  object 
 7   street_number_1     10286 non-null  object 
 8   street_letter_1     125 non-null    object 
 9   street_number_2     886 non-null    object 
 10  street_letter_2     4 non-null      object 
 11  block               14 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               690 non-null    object 
 14  floor               9691 non-null   object 
 15  door                8697 non-null   object 
 16  district_code       10286 non-null  object 
 17  district_name       10286 non-null  object 
 18  neighbourhood_code  10286 non-null  object 
 19  neighbourhood_name  10286 non-null  object 
 20  longitude           0 non-null      object 
 21  latitude            0 non-null      object 
 22  n_places            10270 non-null  float64
dtypes: float64(1), object(22)
memory usage: 1.9+ MB

In [587]:

df_np8['category'].value_counts()

Out[587]:

Habitatges d'Ús Turístic            9409
Pensió                               286
Hotel 4 estrelles                    144
Albergs                              125
Hotel 3 estrelles                    119
Hotel 1 estrella                      47
Hotel 2 estrelles                     42
Hotel 4 estrelles superior            27
Hotel 5 estrelles                     24
Hotel gran luxe                       20
Hotel-Apart 4 estrelles               12
Apartaments Turístics                 12
Hotel-Apart 3 estrelles               10
Hotel-Apart 2 estrelles                6
Hotel-Apart 1 estrella                 2
Hotel-Apart 4 estrelles superior       1
Name: category, dtype: int64

In [588]:

df_n_places = df_np8.copy()

DATAFRAME: COORDINATES FOR TOURIST ACCOMODATIONS IN PRIVATE HOUSES – hut – E.G. AIRBNB

SOURCE:

https://opendata-ajuntament.barcelona.cat/data/en/dataset/habitatges-us-turistic/resource/b32fa7f6-d464-403b-8a02-0292a64883bf

The csv file provided needs cleaning. The problem is created by the neighbourhood: “Sant Pere, Santa Caterina i la Ribera”. The comma within the neighbourhood name split the corresponding column into 2: “Sant Pere” and “Santa Caterina i la Ribera”. Data for that neighbourhood is therefore moved one column on the right, making the file appear to have one more column than it should. This prevents Pandas to allocate data within the apprpriate columns.
A way to clean the data is to:

  • save the file as an excel file
  • split values into columns by using comma as a delimiter
  • frame the data as a table
  • sort all results by including only those referring to “Sant Pere, Santa Caterina i la Ribera”
  • find and replace “Sant Pere” with “Sant Pere, Santa Caterina i la Ribera”
  • cut and paste all data on the right of the column “Santa Caterina i la Ribera” on the same column to eliminate it and reposition the data in the correct place

In [589]:

df_hut_coordinates = pd.read_excel('2022_Hut_comunicacio_cleaned.xlsx')
df_hut_coordinates.head(1)

Out[589]:

N_EXPEDIENTCODI_DISTRICTEDISTRICTECODI_BARRIBARRITIPUS_CARRERCARRERTIPUS_NUMNUM1LLETRA1LLETRA2BLOCPORTALESCALAPISPORTANUMERO_REGISTRE_GENERALITATNUMERO_PLACESLONGITUD_XLATITUD_Y
003-2010-04373SANTS-MONTJUÏC16.0la BordetaCarrerCONSTITUCIO1127NaNNaNNaNNaNNaN42HUTB-0035027.02.13221541.367195

1 rows × 21 columns

In [590]:

df_hut_coordinates1 = df_hut_coordinates.copy()

EXTRACT RELEVANT DATA

In [591]:

#DROP COLUMNS
df_hut_coordinates1.drop(columns=['TIPUS_NUM'], inplace=True)

In [592]:

#RENAME COLUMNS IN ENGLISH
df_hut_coordinates1.columns = ['n_practice_hut','district_code_hut','district_name_hut','neighbourhood_code_hut',
                               'neighbourhood_name_hut','street_type_hut','street_hut',
                               'street_number_1_hut','street_letter_1_hut','street_number_2_hut','street_letter_2_hut',
                               'block_hut','entrance_hut','stair_hut','floor_hut','door_hut','rtc_hut','n_places_hut',
                               'longitude_hut','latitude_hut']
df_hut_coordinates1.head(1)

Out[592]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutstreet_letter_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hut
003-2010-04373SANTS-MONTJUÏC16.0la BordetaCarrerCONSTITUCIO127NaN129.0NaNNaNNaNNaN42HUTB-0035027.02.13221541.367195

In [593]:

#ADD COLUMN NAME WITH CATEGORY REPEATED AS THERE ARE NO NAMES FOR HUTS
df_hut_coordinates1['name_hut'] = "Habitatges d'Ús Turístic"
df_hut_coordinates1.head(1)

Out[593]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut
003-2010-04373SANTS-MONTJUÏC16.0la BordetaCarrerCONSTITUCIO127NaN129.0NaNNaNNaN42HUTB-0035027.02.13221541.367195Habitatges d’Ús Turístic

1 rows × 21 columns

WHITESPACES

In [594]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_hut_coordinates1['n_practice_hut'] = df_hut_coordinates1['n_practice_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['rtc_hut'] = df_hut_coordinates1['rtc_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['street_type_hut'] = df_hut_coordinates1['street_type_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hut_coordinates1['street_hut'] = df_hut_coordinates1['street_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hut_coordinates1['street_number_1_hut'] = df_hut_coordinates1['street_number_1_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['street_letter_1_hut'] = df_hut_coordinates1['street_letter_1_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['street_number_2_hut'] = df_hut_coordinates1['street_number_2_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['street_letter_2_hut'] = df_hut_coordinates1['street_letter_2_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['block_hut'] = df_hut_coordinates1['block_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['entrance_hut'] = df_hut_coordinates1['entrance_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['stair_hut'] = df_hut_coordinates1['stair_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['floor_hut'] = df_hut_coordinates1['floor_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['door_hut'] = df_hut_coordinates1['door_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['district_code_hut'] = df_hut_coordinates1['district_code_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['neighbourhood_code_hut'] = df_hut_coordinates1['neighbourhood_code_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['district_name_hut'] = df_hut_coordinates1['district_name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hut_coordinates1['neighbourhood_name_hut'] = df_hut_coordinates1['neighbourhood_name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hut_coordinates1['longitude_hut'] = df_hut_coordinates1['longitude_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['latitude_hut'] = df_hut_coordinates1['latitude_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hut_coordinates1['n_places_hut'] = df_hut_coordinates1['n_places_hut'].astype(float)
df_hut_coordinates1['name_hut'] = df_hut_coordinates1['name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)

In [595]:

#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10
df_hut_coordinates1[['district_code_hut']] = df_hut_coordinates1[['district_code_hut']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_hut_coordinates1[['neighbourhood_code_hut']] = df_hut_coordinates1[['neighbourhood_code_hut']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [596]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_hut_coordinates1 = df_hut_coordinates1.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)

In [597]:

df_hut_coordinates2 = df_hut_coordinates1.copy()

DUPLICATES

In [598]:

#PRELIMINARY CHECK FOR DUPLICATES
df_hut_coordinates2.duplicated().value_counts()

Out[598]:

False    9409
dtype: int64

In [599]:

df_hut_coordinates2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9409 entries, 0 to 9408
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   n_practice_hut          9409 non-null   object 
 1   district_code_hut       9409 non-null   object 
 2   district_name_hut       9409 non-null   object 
 3   neighbourhood_code_hut  9405 non-null   object 
 4   neighbourhood_name_hut  9409 non-null   object 
 5   street_type_hut         9409 non-null   object 
 6   street_hut              9409 non-null   object 
 7   street_number_1_hut     9409 non-null   object 
 8   street_letter_1_hut     116 non-null    object 
 9   street_number_2_hut     747 non-null    object 
 10  street_letter_2_hut     3 non-null      object 
 11  block_hut               10 non-null     object 
 12  entrance_hut            3 non-null      object 
 13  stair_hut               689 non-null    object 
 14  floor_hut               9378 non-null   object 
 15  door_hut                8528 non-null   object 
 16  rtc_hut                 9361 non-null   object 
 17  n_places_hut            9396 non-null   float64
 18  longitude_hut           9409 non-null   object 
 19  latitude_hut            9409 non-null   object 
 20  name_hut                9409 non-null   object 
dtypes: float64(1), object(20)
memory usage: 1.5+ MB

In [600]:

df_hut_coordinates2.columns

Out[600]:

Index(['n_practice_hut', 'district_code_hut', 'district_name_hut',
       'neighbourhood_code_hut', 'neighbourhood_name_hut', 'street_type_hut',
       'street_hut', 'street_number_1_hut', 'street_letter_1_hut',
       'street_number_2_hut', 'street_letter_2_hut', 'block_hut',
       'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'rtc_hut',
       'n_places_hut', 'longitude_hut', 'latitude_hut', 'name_hut'],
      dtype='object')

In [601]:

#CHECK DUPLICATES BASED ON ADDRESS ONLY - EXCLUDING ID COLUMNS - 'n_practice_hut', 'rtc_hut': 
#TO SEE IF THERE ARE DUPLICATES ONLY IN TERMS OF ADDRESS
df_hut_coordinates2[df_hut_coordinates2.duplicated(subset=['district_code_hut', 'district_name_hut', 'neighbourhood_code_hut', 'neighbourhood_name_hut',
                                                           'street_type_hut', 'street_hut', 'street_number_1_hut',
                                                           'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut','block_hut', 
                                                           'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut','n_places_hut','longitude_hut', 'latitude_hut'], keep=False)].shape[0]

Out[601]:

0

In [602]:

#CHECK FOR DUPLICATES BY FOCUSING ON ID COLUMN: 'rtc_hut' - EXCLUDING NULL VALUES
df_hut_coordinates1[(df_hut_coordinates1.duplicated(subset=['rtc_hut'], keep=False)) & df_hut_coordinates1['rtc_hut'].notnull()].shape[0]

Out[602]:

0

In [603]:

#CHECK FOR DUPLICATES BY FOCUSING ON ID COLUMN: 'n_practice_hut'
df_hut_coordinates1[df_hut_coordinates1.duplicated(subset=['n_practice_hut'], keep=False)].shape[0]

Out[603]:

0

In [604]:

df_hut_coordinates3 = df_hut_coordinates2.copy()

MISSING VALUES

In [605]:

#CHECK FOR MISSING VALUES
df_hut_coordinates3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9409 entries, 0 to 9408
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   n_practice_hut          9409 non-null   object 
 1   district_code_hut       9409 non-null   object 
 2   district_name_hut       9409 non-null   object 
 3   neighbourhood_code_hut  9405 non-null   object 
 4   neighbourhood_name_hut  9409 non-null   object 
 5   street_type_hut         9409 non-null   object 
 6   street_hut              9409 non-null   object 
 7   street_number_1_hut     9409 non-null   object 
 8   street_letter_1_hut     116 non-null    object 
 9   street_number_2_hut     747 non-null    object 
 10  street_letter_2_hut     3 non-null      object 
 11  block_hut               10 non-null     object 
 12  entrance_hut            3 non-null      object 
 13  stair_hut               689 non-null    object 
 14  floor_hut               9378 non-null   object 
 15  door_hut                8528 non-null   object 
 16  rtc_hut                 9361 non-null   object 
 17  n_places_hut            9396 non-null   float64
 18  longitude_hut           9409 non-null   object 
 19  latitude_hut            9409 non-null   object 
 20  name_hut                9409 non-null   object 
dtypes: float64(1), object(20)
memory usage: 1.5+ MB
N_PLACES

In [606]:

#CHECK MISSING VALUES OF INTEREST - N PLACES
df_hut_coordinates2[(df_hut_coordinates2['n_places_hut'].isnull()) 
                    | (df_hut_coordinates2['n_places_hut']=='') 
                    | (df_hut_coordinates2['n_places_hut']=='nan') 
                    | ((df_hut_coordinates2['n_places_hut']=='None'))].shape[0]

Out[606]:

13

FILLING INTO MISSING VALUES WILL BE OPERATED AFTER MERGING TABLES AS IT MIGHT BE POSSIBLE TO RETRIEVE DATA FROM THE OTHER TABLE

In [607]:

df_hut_coordinates3 = df_hut_coordinates2.copy()
DISTRICT – NEIGHBOURHOOD

Comparing table with district and neighbourhood table

In [608]:

#CHECK NULL VALUES
df_hut_coordinates3[(df_hut_coordinates3['neighbourhood_code_hut'].isnull()) | (df_hut_coordinates3['neighbourhood_code_hut']== None) 
                    | (df_hut_coordinates3['neighbourhood_code_hut']=='') | (df_hut_coordinates3['neighbourhood_code_hut']=='')]

Out[608]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut
755610-2013-046710SANT MARTINoneel PoblenounanPIQUER29None37NoneNoneNoneNone37HUTB-00763411.02.2067670241.40025623Habitatges d’Ús Turístic
940106-2012-028606GRACIANonenannanSANT SALVADOR20NoneNoneNoneNoneNone21HUTB-0023346.02.15235733941.40509017Habitatges d’Ús Turístic
940610-2014-047510SANT MARTINonenannanMERIDIANA109NoneNoneNoneNoneNone31HUTB-0113594.02.18546192741.40601944Habitatges d’Ús Turístic
940804-2014-021604LES CORTSNonenannanTAQUIGRAF SERRA1NoneNoneNoneNoneNone25HUTB-0121403.02.13741870541.38397137Habitatges d’Ús Turístic

4 rows × 21 columns

In [609]:

#CHECK NULL VALUES
df_hut_coordinates3[(df_hut_coordinates3['neighbourhood_name_hut'].isnull()) | (df_hut_coordinates3['neighbourhood_name_hut']== None) 
                    | (df_hut_coordinates3['neighbourhood_name_hut']=='nan') | (df_hut_coordinates3['neighbourhood_name_hut']=='')].shape[0]

Out[609]:

9

In [610]:

#CHECK IF NEIGHBOURHOOD CODES ARE IN NEIGHBOURHOOD TABLE
df_hut_coordinates3['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']).value_counts()

Out[610]:

True     9405
False       4
Name: neighbourhood_code_hut, dtype: int64

In [611]:

#CHECK IF NEIGHBOURHOOD NAMES ARE IN NEIGHBOURHOOD TABLE
df_hut_coordinates3['neighbourhood_name_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Name']).value_counts()

Out[611]:

True     8910
False     499
Name: neighbourhood_name_hut, dtype: int64

In [612]:

#CHECK WHICH NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hut_coordinates3[(~df_hut_coordinates3['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])) 
                    & df_hut_coordinates3['neighbourhood_code_hut'].notnull()].value_counts('neighbourhood_code_hut')

Out[612]:

Series([], dtype: int64)

In [613]:

#CHECK WHICH NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hut_coordinates3[df_hut_coordinates3['neighbourhood_code_hut'].isnull()]

Out[613]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut
755610-2013-046710SANT MARTINoneel PoblenounanPIQUER29None37NoneNoneNoneNone37HUTB-00763411.02.2067670241.40025623Habitatges d’Ús Turístic
940106-2012-028606GRACIANonenannanSANT SALVADOR20NoneNoneNoneNoneNone21HUTB-0023346.02.15235733941.40509017Habitatges d’Ús Turístic
940610-2014-047510SANT MARTINonenannanMERIDIANA109NoneNoneNoneNoneNone31HUTB-0113594.02.18546192741.40601944Habitatges d’Ús Turístic
940804-2014-021604LES CORTSNonenannanTAQUIGRAF SERRA1NoneNoneNoneNoneNone25HUTB-0121403.02.13741870541.38397137Habitatges d’Ús Turístic

4 rows × 21 columns

In [614]:

#FIND MATCHING RECORDS IN NEIGHBOURHOOD TABLE
df_district_neighbourhood_table[df_district_neighbourhood_table['Neighbourhood_Name'].str.contains('Poblenou')]

Out[614]:

District_CodeDistrict_NameNeighbourhood_CodeNeighbourhood_Name
6510Sant Martí66el Parc i la Llacuna del Poblenou
6610Sant Martí67la Vila Olímpica del Poblenou
6710Sant Martí68el Poblenou
6810Sant Martí69Diagonal Mar i el Front Marítim del Poblenou
7010Sant Martí71Provençals del Poblenou

In [615]:

df_hut_coordinates4 = df_hut_coordinates3.copy()

In [616]:

#REPLACE VALUES
df_hut_coordinates4.loc[df_hut_coordinates4['neighbourhood_name_hut']=='el Poblenou','neighbourhood_code_hut'] = '68'

In [617]:

#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hut_coordinates4[(~df_hut_coordinates4['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']))]

Out[617]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut
940106-2012-028606GRACIANonenannanSANT SALVADOR20NoneNoneNoneNoneNone21HUTB-0023346.02.15235733941.40509017Habitatges d’Ús Turístic
940610-2014-047510SANT MARTINonenannanMERIDIANA109NoneNoneNoneNoneNone31HUTB-0113594.02.18546192741.40601944Habitatges d’Ús Turístic
940804-2014-021604LES CORTSNonenannanTAQUIGRAF SERRA1NoneNoneNoneNoneNone25HUTB-0121403.02.13741870541.38397137Habitatges d’Ús Turístic

3 rows × 21 columns

MANUAL INPUTATION AS MISSING RECORDS ARE ONLY 3
AV MERIDIANA 109 – neighbourhood_name: el Clot neighbourhood_code: 65
C SANT SALVADOR 20 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C TAQUIGRAF SERRA 1 – neighbourhood_name: les Corts neighbourhood_code: 19

SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [618]:

#FILL IN MISSING VALUES
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='10-2014-0475','neighbourhood_code_hut'] = '65'
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='10-2014-0475','neighbourhood_name_hut'] = 'el Clot'
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='06-2012-0286','neighbourhood_code_hut'] = '31'
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='06-2012-0286','neighbourhood_name_hut'] = 'la Vila de Gràcia'
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='04-2014-0216','neighbourhood_code_hut'] = '19'
df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='04-2014-0216','neighbourhood_name_hut'] = 'les Corts'

In [619]:

#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT IN THE NEIGHBOURHOOD TABLE
df_hut_coordinates4[(~df_hut_coordinates4['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']))]

Out[619]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut

0 rows × 21 columns

In [620]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_hut_coordinates4['neighbourhood_name_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name'])
df_hut_coordinates4['district_code_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code'])
df_hut_coordinates4['district_name_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])

In [621]:

#CHECK NULL VALUES
df_hut_coordinates4[(df_hut_coordinates4['neighbourhood_code_hut'].isnull()) | (df_hut_coordinates4['neighbourhood_code_hut']== None) 
                    | (df_hut_coordinates4['neighbourhood_code_hut']=='nan') | (df_hut_coordinates4['neighbourhood_code_hut']=='')]

Out[621]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut

0 rows × 21 columns

In [622]:

#CHECK NULL VALUES
df_hut_coordinates4[(df_hut_coordinates4['neighbourhood_name_hut'].isnull()) | (df_hut_coordinates4['neighbourhood_name_hut']== None) 
                    | (df_hut_coordinates4['neighbourhood_name_hut']=='nan') | (df_hut_coordinates4['neighbourhood_name_hut']=='')]

Out[622]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut

0 rows × 21 columns

In [623]:

df_hut_coordinates4[['district_code_hut','district_name_hut','neighbourhood_code_hut','neighbourhood_name_hut']].sort_values('neighbourhood_code_hut')

Out[623]:

district_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hut
22401Ciutat Vella01el Raval
722801Ciutat Vella01el Raval
722901Ciutat Vella01el Raval
723001Ciutat Vella01el Raval
723101Ciutat Vella01el Raval
439510Sant Martí73la Verneda i la Pau
439610Sant Martí73la Verneda i la Pau
439710Sant Martí73la Verneda i la Pau
439310Sant Martí73la Verneda i la Pau
439110Sant Martí73la Verneda i la Pau

9409 rows × 4 columns

In [624]:

df_hut_coordinates5 = df_hut_coordinates4.copy()

NORMALIZATION

In [625]:

df_hut_coordinates5.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9409 entries, 0 to 9408
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   n_practice_hut          9409 non-null   object 
 1   district_code_hut       9409 non-null   object 
 2   district_name_hut       9409 non-null   object 
 3   neighbourhood_code_hut  9409 non-null   object 
 4   neighbourhood_name_hut  9409 non-null   object 
 5   street_type_hut         9409 non-null   object 
 6   street_hut              9409 non-null   object 
 7   street_number_1_hut     9409 non-null   object 
 8   street_letter_1_hut     116 non-null    object 
 9   street_number_2_hut     747 non-null    object 
 10  street_letter_2_hut     3 non-null      object 
 11  block_hut               10 non-null     object 
 12  entrance_hut            3 non-null      object 
 13  stair_hut               689 non-null    object 
 14  floor_hut               9378 non-null   object 
 15  door_hut                8528 non-null   object 
 16  rtc_hut                 9361 non-null   object 
 17  n_places_hut            9396 non-null   float64
 18  longitude_hut           9409 non-null   object 
 19  latitude_hut            9409 non-null   object 
 20  name_hut                9409 non-null   object 
dtypes: float64(1), object(20)
memory usage: 1.5+ MB

In [626]:

#CHECK 'street_number_1'
df_hut_coordinates5[df_hut_coordinates5['street_number_1_hut'].str.isdecimal()==False].value_counts('street_number_1_hut') 

Out[626]:

Series([], dtype: int64)

In [627]:

#CHECK 'street_number_2'
df_hut_coordinates5[df_hut_coordinates5['street_number_2_hut'].str.isdecimal()==False].value_counts('street_number_2_hut')

Out[627]:

Series([], dtype: int64)

In [628]:

df_hut = df_hut_coordinates5.copy()

DATAFRAME: COORDINATES FOR HOTELS

SOURCE:

https://opendata-ajuntament.barcelona.cat/data/en/dataset/allotjaments-hotels/resource/9bccce1b-0b9d-4cc6-94a7-459cb99450de

In [629]:

R_ID_HOTEL = '9bccce1b-0b9d-4cc6-94a7-459cb99450de'
url_hotel = 'https://opendata-ajuntament.barcelona.cat/data/api/action/datastore_search_sql?sql=SELECT%20*%20from%20%22{}%22'.format(R_ID_HOTEL)
response_hotel = requests.get(url_hotel)
if response_hotel.ok:
    data_hotel = response_hotel.json()
    result_hotel = data_hotel.get('result')
    records_hotel = result_hotel.get('records')
else:
    print('Problem with: ', url_hotel)
df_hotel_coordinates = pd.DataFrame.from_dict(records_hotel)
df_hotel_coordinates.head(1)

Out[629]:

addresses_roadtype_nameaddresses_end_street_numberinstitution_namevalues_attribute_nameaddresses_road_namevalues_categoryaddresses_zip_codesecondary_filters_idvalues_valueaddresses_towngeo_epgs_25831_y_full_textmodifiedsecondary_filters_asia_idsecondary_filters_fullpathvalues_description_idaddresses_neighborhood_namevalues_outstandingvalues_attribute_id
0NoneCentraletaC RamblaTelèfons800254731072933010872BARCELONA4581844.702237997‘+02’:36,48 ‘-000480’:6 ‘-09’:31,43 ‘-17’:32,4…2022-09-17T02:41:22.07491565103001000004Planol BCN >> Allotjament >> Hotels >> Hotels …1el Barri GòticTrue20003

1 rows × 38 columns

In [630]:

df_hotel_coordinates1 = df_hotel_coordinates.copy()

EXTRACT RELEVANT DATA

In [631]:

#REMOVE UNNECESSARY COLUMNS
df_hotel_coordinates1.columns

Out[631]:

Index(['addresses_roadtype_name', 'addresses_end_street_number',
       'institution_name', 'values_attribute_name', 'addresses_road_name',
       'values_category', 'addresses_zip_code', 'secondary_filters_id',
       'values_value', 'addresses_town', 'geo_epgs_4326_y', 'geo_epgs_4326_x',
       'secondary_filters_name', 'secondary_filters_tree',
       'addresses_district_name', 'geo_epgs_25831_x',
       'addresses_start_street_number', 'register_id', 'institution_id',
       'addresses_main_address', 'addresses_district_id',
       'addresses_roadtype_id', 'addresses_type', 'addresses_neighborhood_id',
       'values_id', 'name', 'addresses_road_id', 'created', 'geo_epgs_25831_y',
       '_full_text', 'modified', 'secondary_filters_asia_id',
       'secondary_filters_fullpath', 'values_description', '_id',
       'addresses_neighborhood_name', 'values_outstanding',
       'values_attribute_id'],
      dtype='object')

In [632]:

#CHECK BEFORE ELIMINATING COLUMNS
df_hotel_coordinates1[['created','modified','institution_name','institution_id','_full_text',
          'values_id','values_attribute_name','values_category','values_value',
          'values_outstanding','values_attribute_id','values_description']].head(1)

Out[632]:

createdmodifiedinstitution_nameinstitution_id_full_textvalues_idvalues_attribute_namevalues_categoryvalues_valuevalues_outstandingvalues_attribute_idvalues_description
01996-09-17T00:00:002022-09-17T02:41:22.074915‘+02’:36,48 ‘-000480’:6 ‘-09’:31,43 ‘-17’:32,4…136360CentraletaTelèfons933010872True20003

In [633]:

#CHECK BEFORE ELIMINATING COLUMNS
df_hotel_coordinates1[['_id','secondary_filters_asia_id','secondary_filters_fullpath','secondary_filters_tree',
          'addresses_roadtype_name','addresses_main_address', 'addresses_type', 'addresses_road_id', 
          'addresses_roadtype_id','geo_epgs_25831_x', 'geo_epgs_25831_y']].head(1)

Out[633]:

_idsecondary_filters_asia_idsecondary_filters_fullpathsecondary_filters_treeaddresses_roadtype_nameaddresses_main_addressaddresses_typeaddresses_road_idaddresses_roadtype_idgeo_epgs_25831_xgeo_epgs_25831_y
0165103001000004Planol BCN >> Allotjament >> Hotels >> Hotels …651True34308430656.7440331814581844.702237997

In [634]:

#CHECK BEFORE ELIMINATING COLUMNS
df_hotel_coordinates1[['addresses_zip_code','secondary_filters_id',
          'addresses_town','register_id']].head(1)

Out[634]:

addresses_zip_codesecondary_filters_idaddresses_townregister_id
0800254731072BARCELONA75990025172

In [635]:

#DROP COLUMNS
df_hotel_coordinates1.drop(columns=['created','modified','institution_name','institution_id','_full_text',
                       'values_id','values_attribute_name','values_category','values_value',
                       'values_outstanding','values_attribute_id','values_description','_id',
                       'secondary_filters_asia_id','secondary_filters_fullpath',
                       'secondary_filters_tree','addresses_roadtype_name','addresses_main_address', 
                       'addresses_type', 'addresses_road_id', 'addresses_roadtype_id','geo_epgs_25831_x', 
                       'geo_epgs_25831_y','addresses_zip_code',
                       'secondary_filters_id','addresses_town','register_id'], inplace=True)
df_hotel_coordinates1.head(1)

Out[635]:

addresses_end_street_numberaddresses_road_namegeo_epgs_4326_ygeo_epgs_4326_xsecondary_filters_nameaddresses_district_nameaddresses_start_street_numberaddresses_district_idaddresses_neighborhood_idnameaddresses_neighborhood_name
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella13812Hotel Toledano – HB-000480el Barri Gòtic

In [636]:

df_hotel_coordinates2 = df_hotel_coordinates1.copy()

In [637]:

#RENAME COLUMNS
df_hotel_coordinates2.columns = ['street_number_2_hotel','address_hotel','longitude_hotel','latitude_hotel','category_hotel','district_name_hotel','street_number_1_hotel',
                                 'district_code_hotel','neighbourhood_code_hotel','name_hotel','neighbourhood_name_hotel']
df_hotel_coordinates2.head(1)

Out[637]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotel
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella13812Hotel Toledano – HB-000480el Barri Gòtic

In [638]:

#FUNCTION TO EXTRACT ID COLUMN
def rtc_split_hotel(arg):
    if "HB-" in arg:
        return arg.split("HB-", 1) # 1 : to split at the first found only
    else:
        return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length

In [639]:

#EXTRACT ID COLUMN RTC FROM NAME COLUMN
df_hotel_coordinates2[['name_hotel','rtc_hotel']] = [rtc_split_hotel(x) for x in df_hotel_coordinates2['name_hotel']]
df_hotel_coordinates2['rtc_hotel'] = 'HB-' + df_hotel_coordinates2['rtc_hotel']
df_hotel_coordinates2['rtc_hotel'] = df_hotel_coordinates2['rtc_hotel'].str.strip().replace(' ','')
df_hotel_coordinates2['name_hotel'] = df_hotel_coordinates2['name_hotel'].str.replace('-',' ').str.strip()
df_hotel_coordinates2.head(1)

Out[639]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella13812Hotel Toledanoel Barri GòticHB-000480

In [640]:

df_hotel_coordinates3 = df_hotel_coordinates2.copy()
WHITESPACES

In [641]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_hotel_coordinates3['rtc_hotel'] = df_hotel_coordinates3['rtc_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hotel_coordinates3['address_hotel'] = df_hotel_coordinates3['address_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hotel_coordinates3['street_number_1_hotel'] = df_hotel_coordinates3['street_number_1_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hotel_coordinates3['street_number_2_hotel'] = df_hotel_coordinates3['street_number_2_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hotel_coordinates3['district_code_hotel'] = df_hotel_coordinates3['district_code_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_hotel_coordinates3['neighbourhood_code_hotel'] = df_hotel_coordinates3['neighbourhood_code_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_hotel_coordinates3['neighbourhood_name_hotel'] = df_hotel_coordinates3['neighbourhood_name_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hotel_coordinates3['longitude_hotel'] = df_hotel_coordinates3['longitude_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hotel_coordinates3['latitude_hotel'] = df_hotel_coordinates3['latitude_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hotel_coordinates3['name_hotel'] = df_hotel_coordinates3['name_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)

In [642]:

#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10
df_hotel_coordinates3[['district_code_hotel']] = df_hotel_coordinates3[['district_code_hotel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_hotel_coordinates3[['neighbourhood_code_hotel']] = df_hotel_coordinates3[['neighbourhood_code_hotel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [643]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_hotel_coordinates3 = df_hotel_coordinates3.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)

In [644]:

df_hotel_coordinates4 = df_hotel_coordinates3.copy()

DUPLICATES

In [645]:

#QUICK CHECK FOR DUPLICATES
df_hotel_coordinates4.duplicated().value_counts()

Out[645]:

False    441
dtype: int64

In [646]:

#VERIFY ID COLUMN: 'rtc_hotel'
df_hotel_coordinates4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   street_number_2_hotel     13 non-null     object
 1   address_hotel             441 non-null    object
 2   longitude_hotel           441 non-null    object
 3   latitude_hotel            441 non-null    object
 4   category_hotel            441 non-null    object
 5   district_name_hotel       441 non-null    object
 6   street_number_1_hotel     439 non-null    object
 7   district_code_hotel       441 non-null    object
 8   neighbourhood_code_hotel  441 non-null    object
 9   name_hotel                441 non-null    object
 10  neighbourhood_name_hotel  441 non-null    object
 11  rtc_hotel                 440 non-null    object
dtypes: object(12)
memory usage: 41.5+ KB

In [647]:

#CHECK DUPLICATES ON ID COLUMN: 'rtc_hotel'
df_hotel_coordinates4[df_hotel_coordinates4.duplicated(subset='rtc_hotel', keep=False)]

Out[647]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel

MISSING VALUES

In [648]:

df_hotel_coordinates4.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   street_number_2_hotel     13 non-null     object
 1   address_hotel             441 non-null    object
 2   longitude_hotel           441 non-null    object
 3   latitude_hotel            441 non-null    object
 4   category_hotel            441 non-null    object
 5   district_name_hotel       441 non-null    object
 6   street_number_1_hotel     439 non-null    object
 7   district_code_hotel       441 non-null    object
 8   neighbourhood_code_hotel  441 non-null    object
 9   name_hotel                441 non-null    object
 10  neighbourhood_name_hotel  441 non-null    object
 11  rtc_hotel                 440 non-null    object
dtypes: object(12)
memory usage: 41.5+ KB
RTC

In [649]:

#IDENTIFY NULL VALUES IN ID COLUMN: 'rtc_hotel'
df_hotel_coordinates4[(df_hotel_coordinates4['rtc_hotel'].isnull()) 
                      | (df_hotel_coordinates4['rtc_hotel']=='nan') 
                      |  (df_hotel_coordinates4['rtc_hotel']==None) 
                      |  (df_hotel_coordinates4['rtc_hotel']=='')]

Out[649]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
14784Ronda de Sant Antoni2.16397494742350241.38376584459594Hotels 4 estr.Ciutat Vella840101Hotel Antiga Casa Buenavistael RavalNone
DISTRICT – NEIGHBOURHOOD

In [650]:

#IDENTIFY NULL VALUES
df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_code_hotel'].isnull()) 
                      | (df_hotel_coordinates4['neighbourhood_code_hotel']=='nan') 
                      |  (df_hotel_coordinates4['neighbourhood_code_hotel']==None) 
                      |  (df_hotel_coordinates4['neighbourhood_code_hotel']=='')].shape[0]

Out[650]:

0

In [651]:

#IDENTIFY NULL VALUES
df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_name_hotel'].isnull()) 
                      | (df_hotel_coordinates4['neighbourhood_name_hotel']=='nan') 
                      |  (df_hotel_coordinates4['neighbourhood_name_hotel']==None) 
                      |  (df_hotel_coordinates4['neighbourhood_name_hotel']=='')].shape[0]

Out[651]:

0

In [652]:

#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_code_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[652]:

0

In [653]:

#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_name_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[653]:

0

In [654]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_hotel_coordinates4['neighbourhood_name_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name'])
df_hotel_coordinates4['district_code_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code'])
df_hotel_coordinates4['district_name_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])

In [655]:

#IDENTIFY NULL VALUES
df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_code_hotel'].isnull()) 
                      | (df_hotel_coordinates4['neighbourhood_code_hotel']=='nan') 
                      |  (df_hotel_coordinates4['neighbourhood_code_hotel']==None) 
                      |  (df_hotel_coordinates4['neighbourhood_code_hotel']=='')].shape[0]

Out[655]:

0

In [656]:

#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_code_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[656]:

0

In [657]:

df_hotel_coordinates4.head(1)

Out[657]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella1380102Hotel Toledanoel Barri GòticHB-000480

In [658]:

df_hotel_coordinates5 = df_hotel_coordinates4.copy()

NORMALIZATION

In [659]:

#CHECK 'street_number_1'
df_hotel_coordinates5[df_hotel_coordinates5['street_number_1_hotel'].str.isdecimal()==False].value_counts('street_number_1_hotel') 

Out[659]:

Series([], dtype: int64)

In [660]:

#CHECK 'street_number_2'
df_hotel_coordinates5[df_hotel_coordinates5['street_number_2_hotel'].str.isdecimal()==False].value_counts('street_number_2_hotel')

Out[660]:

Series([], dtype: int64)

In [661]:

df_hotel_coordinates5['category_hotel'].value_counts()

Out[661]:

Hotels 4 estr.    187
Hotels 3 estr.    121
Hotels 2 estr.     50
Hotels 5 estr.     44
Hotels 1 estr.     39
Name: category_hotel, dtype: int64

In [662]:

df_hotel = df_hotel_coordinates5.copy()

In [663]:

df_hotel.head(1)

Out[663]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella1380102Hotel Toledanoel Barri GòticHB-000480

DATAFRAME: COORDINATES FOR HOSTELS

SOURCE:

https://opendata-ajuntament.barcelona.cat/data/en/dataset/allotjaments-pensions

In [664]:

url_hostel = 'https://www.bcn.cat/tercerlloc/files/allotjament/opendatabcn_allotjament_pensions-js.json'
response_hostel = requests.get(url_hostel)
if response_hostel.ok:
    data_hostel = response_hostel.json()
else:
    print('Problem with: ', url_hostel)
df_hostel_coordinates = pd.DataFrame.from_dict(data_hostel)

In [665]:

df_hostel_coordinates1 = df_hostel_coordinates.copy()

EXTRACT RELEVANT DATA

In [666]:

df_hostel_coordinates1.columns

Out[666]:

Index(['register_id', 'prefix', 'suffix', 'name', 'created', 'modified',
       'status', 'status_name', 'core_type', 'core_type_name', 'body',
       'tickets_data', 'addresses', 'entity_types_data',
       'attribute_categories', 'values', 'from_relationships',
       'to_relationships', 'classifications_data', 'secondary_filters_data',
       'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831',
       'geo_epgs_23031', 'geo_epgs_4326', 'is_section_of_data',
       'sections_data', 'start_date', 'end_date', 'estimated_dates',
       'languages_data', 'type', 'type_name', 'period', 'period_name',
       'event_status_name', 'event_status', 'ical'],
      dtype='object')

In [667]:

#CHECK BEFORE DROPPING
df_hostel_coordinates1[['register_id', 'prefix', 'suffix', 'created', 'modified', 'status', 
           'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 
           'from_relationships', 'to_relationships', 'timetable', 'image_data']].head(1)

Out[667]:

register_idprefixsuffixcreatedmodifiedstatusstatus_namecore_typecore_type_namebodytickets_datafrom_relationshipsto_relationshipstimetableimage_data
01166132202NoneNone2001-06-15T00:00:00+02:002022-09-17T02:23:43.565716+02:00publishedPublicatplaceEquipamentNone[][][]NoneNone

In [668]:

#CHECK BEFORE DROPPING
df_hostel_coordinates1[['gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 
           'is_section_of_data', 'sections_data', 'start_date', 'end_date', 
           'estimated_dates', 'languages_data', 'type', 'type_name', 'period', 
           'period_name', 'event_status_name', 'event_status', 'ical']].head(1)

Out[668]:

gallery_datawarningsgeo_epgs_25831geo_epgs_23031is_section_of_datasections_datastart_dateend_dateestimated_dateslanguages_datatypetype_nameperiodperiod_nameevent_status_nameevent_statusical
0[][]{‘x’: 431059.3968618301, ‘y’: 4583099.462077796}{‘x’: 431153.93913180125, ‘y’: 4583304.042183909}None[]NoneNoneNoneNoneNoneNoneNoneNoneNoneNoneBEGIN:VCALENDARrnPRODID:ics.py – http://git….

In [669]:

#CHECK BEFORE DROPPING
df_hostel_coordinates1[['secondary_filters_data','entity_types_data','attribute_categories','values']].head(1)

Out[669]:

secondary_filters_dataentity_types_dataattribute_categoriesvalues
0[{‘id’: 57245924, ‘name’: ’03. Hotels, pension…[{‘id’: 102, ‘name’: ‘equipament’}, {‘id’: 100…[{‘id’: 2, ‘name’: ‘Informació d’interès’, ‘…[{‘id’: 33979, ‘value’: ‘hello@hostalin.com’, …

In [670]:

#DROP COLUMNS
df_hostel_coordinates1.drop(columns=['register_id', 'prefix', 'suffix', 'created', 'modified', 
                        'status', 'status_name', 'core_type', 'core_type_name', 'body', 
                        'tickets_data', 'from_relationships', 'to_relationships', 'timetable', 
                        'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831',
                        'geo_epgs_23031', 'is_section_of_data', 'sections_data', 'start_date', 
                        'end_date', 'estimated_dates','languages_data', 'type', 'type_name', 
                        'period', 'period_name','event_status_name', 'event_status', 'ical',
                        'secondary_filters_data','entity_types_data','attribute_categories','values'], inplace=True)
df_hostel_coordinates1.head(1)

Out[670]:

nameaddressesclassifications_datageo_epgs_4326
0Hostal Hostalin Barcelona Diputació – HB-004497[{‘place’: None, ‘district_name’: ‘Eixample’, …[{‘id’: 1003005, ‘name’: ‘Pensions, hostals’, …{‘x’: 41.3964776648101, ‘y’: 2.175311353516649}

In [671]:

df_hostel_coordinates2 = df_hostel_coordinates1.copy()

In [672]:

#FUNCTION TO SPLIT COORDINATES
def split_coordinates(arg):
    if "," in arg:
        return arg.split(",",1) # 1 : to split at the first found only
    else:
        return (arg, None) # None : to add a Null value when split character not found and so preserve the same column length

In [673]:

#SPLIT AND RENAME COLUMNS
df_hostel_coordinates2['geo_epgs_4326'] = df_hostel_coordinates2['geo_epgs_4326'].astype(str)
df_hostel_coordinates2[['latitude_hostel','longitude_hostel']] = [split_coordinates(x) for x in df_hostel_coordinates2['geo_epgs_4326']]
df_hostel_coordinates2['latitude_hostel'] = df_hostel_coordinates2['latitude_hostel'].str.replace("{'x': ",'',regex=True)
df_hostel_coordinates2['longitude_hostel'] = df_hostel_coordinates2['longitude_hostel'].str.replace("'y': ",'',regex=True).replace('}','', regex=True)
df_hostel_coordinates2.drop(columns=['geo_epgs_4326'],inplace=True)
df_hostel_coordinates2.head(1)

Out[673]:

nameaddressesclassifications_datalatitude_hostellongitude_hostel
0Hostal Hostalin Barcelona Diputació – HB-004497[{‘place’: None, ‘district_name’: ‘Eixample’, …[{‘id’: 1003005, ‘name’: ‘Pensions, hostals’, …41.39647766481012.175311353516649

In [674]:

df_hostel_coordinates2.loc[0,'classifications_data']

Out[674]:

[{'id': 1003005,
  'name': 'Pensions, hostals',
  'full_path': 'Tipologia EQ >> Allotjament >> Pensions, hostals',
  'dependency_group': 3033964,
  'parent_id': 1003,
  'tree_id': 1,
  'asia_id': '0000102003005',
  'core_type': 'place',
  'level': 2},
 {'id': 108215,
  'name': 'Gay Friendly ',
  'full_path': 'Col·lectius EQ >> Gay Friendly ',
  'dependency_group': 3033964,
  'parent_id': 108,
  'tree_id': 108,
  'asia_id': '0010801215',
  'core_type': 'place',
  'level': 1}]

In [675]:

classifications_data  = []
for i in df_hostel_coordinates2['classifications_data']:
        c = list([x.get('name') for x in i])[0] #to get first item from dictionary with 'name' as key
        classifications_data.append(c)
df_hostel_coordinates2['category_hostel'] = classifications_data
df_hostel_coordinates2.drop(columns='classifications_data', inplace=True)
df_hostel_coordinates2.head(1)

Out[675]:

nameaddresseslatitude_hostellongitude_hostelcategory_hostel
0Hostal Hostalin Barcelona Diputació – HB-004497[{‘place’: None, ‘district_name’: ‘Eixample’, …41.39647766481012.175311353516649Pensions, hostals

In [676]:

df_hostel_coordinates2.loc[0,'addresses']

Out[676]:

[{'place': None,
  'district_name': 'Eixample',
  'district_id': '02',
  'neighborhood_name': "la Dreta de l'Eixample",
  'neighborhood_id': '07',
  'address_name': 'C Diputació',
  'address_id': '100800',
  'block_id': None,
  'start_street_number': 346,
  'end_street_number': None,
  'street_number_1': '346',
  'street_number_2': None,
  'stairs': None,
  'level': '1r',
  'door': '1a',
  'zip_code': '08013',
  'province': 'BARCELONA',
  'town': 'BARCELONA',
  'country': 'ESPANYA',
  'comments': None,
  'position': 0,
  'main_address': True,
  'road_name': None,
  'road_id': None,
  'roadtype_name': None,
  'roadtype_id': None,
  'location': {'type': 'GeometryCollection',
   'geometries': [{'type': 'Point',
     'coordinates': [431059.3968618301, 4583099.462077796]}]},
  'related_entity': None,
  'related_entity_data': None,
  'hide_address': False}]

In [677]:

district_name = []
district_id = []
neighborhood_name = []
neighborhood_id = []
address_name  = []
street_number_1 = []
street_number_2 = []
for i in df_hostel_coordinates2['addresses']:
        dn = list([x.get('district_name') for x in i])[0]
        dc = list([x.get('district_id') for x in i])[0]
        nn = list([x.get('neighborhood_name') for x in i])[0]
        nc = list([x.get('neighborhood_id') for x in i])[0]
        an = list([x.get('address_name') for x in i])[0]
        sn1 = list([x.get('street_number_1') for x in i])[0]
        sn2 = list([x.get('street_number_2') for x in i])[0]
        district_name.append(dn)
        district_id.append(dc)
        neighborhood_name.append(nn)
        neighborhood_id.append(nc)
        address_name.append(an)
        street_number_1.append(sn1)
        street_number_2.append(sn2)
df_hostel_coordinates2['district_code_hostel'] = district_id
df_hostel_coordinates2['district_name_hostel'] = district_name
df_hostel_coordinates2['neighbourhood_code_hostel'] = neighborhood_id
df_hostel_coordinates2['neighbourhood_name_hostel'] = neighborhood_name
df_hostel_coordinates2['address_hostel'] = address_name
df_hostel_coordinates2['street_number_1_hostel'] = street_number_1
df_hostel_coordinates2['street_number_2_hostel'] = street_number_2
df_hostel_coordinates2.drop(columns='addresses', inplace=True)
df_hostel_coordinates2.head(1)

Out[677]:

namelatitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostel
0Hostal Hostalin Barcelona Diputació – HB-00449741.39647766481012.175311353516649Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació346None

In [678]:

df_hostel_coordinates3 = df_hostel_coordinates2.copy()

In [679]:

#FUNCTION TO EXTRACT ID COLUMN
def rtc_split_hostel(arg):
    if "HB-" in arg:
        return arg.split("HB-", 1) # 1 : to split at the first found only
    else:
        return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length

In [680]:

df_hostel_coordinates3[['name','rtc_hostel']] = [rtc_split_hostel(x) for x in df_hostel_coordinates3['name']]
df_hostel_coordinates3['rtc_hostel'] = 'HB-' + df_hostel_coordinates3['rtc_hostel']
df_hostel_coordinates3['rtc_hostel'] = df_hostel_coordinates3['rtc_hostel'].str.strip().replace(' ','')
df_hostel_coordinates3['name_hostel'] = df_hostel_coordinates3['name'].str.replace('-',' ').str.strip()
df_hostel_coordinates3.drop(columns='name', inplace=True)
df_hostel_coordinates3.head(1)

Out[680]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
041.39647766481012.175311353516649Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació346NoneHB-004497Hostal Hostalin Barcelona Diputació

In [681]:

df_hostel_coordinates4 = df_hostel_coordinates3.copy()
WHITESPACES

In [682]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_hostel_coordinates4['rtc_hostel'] = df_hostel_coordinates4['rtc_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hostel_coordinates4['address_hostel'] = df_hostel_coordinates4['address_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hostel_coordinates4['street_number_1_hostel'] = df_hostel_coordinates4['street_number_1_hostel'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hostel_coordinates4['street_number_2_hostel'] = df_hostel_coordinates4['street_number_2_hostel'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hostel_coordinates4['district_code_hostel'] = df_hostel_coordinates4['district_code_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_hostel_coordinates4['neighbourhood_code_hostel'] = df_hostel_coordinates4['neighbourhood_code_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_hostel_coordinates4['neighbourhood_name_hostel'] = df_hostel_coordinates4['neighbourhood_name_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_hostel_coordinates4['longitude_hostel'] = df_hostel_coordinates4['longitude_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hostel_coordinates4['latitude_hostel'] = df_hostel_coordinates4['latitude_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_hostel_coordinates4['name_hostel'] = df_hostel_coordinates4['name_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)

In [683]:

#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10
df_hostel_coordinates4[['district_code_hostel']] = df_hostel_coordinates4[['district_code_hostel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_hostel_coordinates4[['neighbourhood_code_hostel']] = df_hostel_coordinates4[['neighbourhood_code_hostel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [684]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_hostel_coordinates4 = df_hostel_coordinates4.applymap(lambda x: None if isinstance(x, str) and (not x or x.isspace()) else x)

In [685]:

df_hostel_coordinates5 = df_hostel_coordinates4.copy()

DUPLICATES

In [686]:

df_hostel_coordinates5.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   latitude_hostel            244 non-null    object
 1   longitude_hostel           244 non-null    object
 2   category_hostel            244 non-null    object
 3   district_code_hostel       244 non-null    object
 4   district_name_hostel       244 non-null    object
 5   neighbourhood_code_hostel  244 non-null    object
 6   neighbourhood_name_hostel  244 non-null    object
 7   address_hostel             244 non-null    object
 8   street_number_1_hostel     244 non-null    object
 9   street_number_2_hostel     0 non-null      object
 10  rtc_hostel                 242 non-null    object
 11  name_hostel                244 non-null    object
dtypes: object(12)
memory usage: 23.0+ KB

In [687]:

df_hostel_coordinates5.duplicated().value_counts()

Out[687]:

False    243
True       1
dtype: int64

In [688]:

df_hostel_coordinates5[df_hostel_coordinates5.duplicated(keep=False)]

Out[688]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
22641.395270249373762.170401163741668Pensions, hostals02Eixample07la Dreta de l’EixampleC Girona81NoneHB-004707Hostal Retrome 2
24241.395270249373762.170401163741668Pensions, hostals02Eixample07la Dreta de l’EixampleC Girona81NoneHB-004707Hostal Retrome 2

In [689]:

df_hostel_coordinates5 = df_hostel_coordinates5.drop_duplicates()

In [690]:

#CHECK DUPLICATES ON ID COLUMN
df_hostel_coordinates5[df_hostel_coordinates5.duplicated(subset='rtc_hostel', keep=False)]

Out[690]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
22141.3938880044157142.171481971236943Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació327NoneNoneHostal Bed & Break
22741.379195783894792.174445287874625Pensions, hostals01Ciutat Vella01el RavalC Nou de la Rambla1NoneNoneHostal Mimi Las Ramblas

In [691]:

df_hostel_coordinates6 = df_hostel_coordinates5.copy()

MISSING VALUES

In [692]:

df_hostel_coordinates6.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 243 entries, 0 to 243
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   latitude_hostel            243 non-null    object
 1   longitude_hostel           243 non-null    object
 2   category_hostel            243 non-null    object
 3   district_code_hostel       243 non-null    object
 4   district_name_hostel       243 non-null    object
 5   neighbourhood_code_hostel  243 non-null    object
 6   neighbourhood_name_hostel  243 non-null    object
 7   address_hostel             243 non-null    object
 8   street_number_1_hostel     243 non-null    object
 9   street_number_2_hostel     0 non-null      object
 10  rtc_hostel                 241 non-null    object
 11  name_hostel                243 non-null    object
dtypes: object(12)
memory usage: 24.7+ KB
RTC

In [693]:

df_hostel_coordinates6[df_hostel_coordinates6['rtc_hostel'].isnull()]

Out[693]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
22141.3938880044157142.171481971236943Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació327NoneNoneHostal Bed & Break
22741.379195783894792.174445287874625Pensions, hostals01Ciutat Vella01el RavalC Nou de la Rambla1NoneNoneHostal Mimi Las Ramblas
DISTRICT – NEIGHBOURHOOD

In [694]:

#IDENTIFY NULL VALUES
df_hostel_coordinates6[(df_hostel_coordinates6['neighbourhood_code_hostel'].isnull()) 
                       | (df_hostel_coordinates6['neighbourhood_code_hostel']=='nan') 
                       |  (df_hostel_coordinates6['neighbourhood_code_hostel']==None) 
                       |  (df_hostel_coordinates6['neighbourhood_code_hostel']=='')].shape[0]

Out[694]:

0

In [695]:

#IDENTIFY NULL VALUES
df_hostel_coordinates6[(df_hostel_coordinates6['neighbourhood_name_hostel'].isnull()) 
                       | (df_hostel_coordinates6['neighbourhood_name_hostel']=='nan') 
                       |  (df_hostel_coordinates6['neighbourhood_name_hostel']==None) 
                       |  (df_hostel_coordinates6['neighbourhood_name_hostel']=='')].shape[0]

Out[695]:

0

In [696]:

#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_name_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[696]:

61

In [697]:

#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_code_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[697]:

0

In [698]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_hostel_coordinates6['neighbourhood_name_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name'])
df_hostel_coordinates6['district_code_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code'])
df_hostel_coordinates6['district_name_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])

In [699]:

#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_name_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[699]:

0

NORMALIZATION

In [700]:

df_hostel_coordinates7 = df_hostel_coordinates6.copy()

In [701]:

#CHECK 'street_number_1'
df_hostel_coordinates7[df_hostel_coordinates7['street_number_1_hostel'].str.isdecimal()==False].value_counts('street_number_1_hostel') 

Out[701]:

street_number_1_hostel
1*3        4
116*LB     1
11C        1
149*155    1
16*18      1
2*4        1
3*5        1
32*34      1
373*377    1
433*LB     1
56*58      1
77*79      1
8*10       1
83*85      1
95*97      1
98*100     1
dtype: int64

In [702]:

#CHECK 'street_number_2'
df_hostel_coordinates7[df_hostel_coordinates7['street_number_2_hostel'].str.isdecimal()==False].value_counts('street_number_2_hostel')

Out[702]:

Series([], dtype: int64)

In [703]:

def split_street_number_1_2 (df_target,column_address_1,column_address_2):
    df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1]
    df = df.str.split(pat='(d+)', expand=True)
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost
    return print('Split Values:'), df

In [704]:

split_street_number_1_2(df_hostel_coordinates7,'street_number_1_hostel','street_number_2_hostel')
Split Values:

Out[704]:

(None,
     0    1    2     3     4
 14       1    *     3      
 25     433  *LB  None  None
 38       1    *     3      
 50      16    *    18      
 52       3    *     5      
 63       2    *     4      
 84     116  *LB  None  None
 99      98    *   100      
 110      8    *    10      
 124     11    C  None  None
 141     77    *    79      
 157     32    *    34      
 158    149    *   155      
 198     95    *    97      
 201     56    *    58      
 207      1    *     3      
 208      1    *     3      
 225     83    *    85      
 239    373    *   377      )

In [705]:

#CHECK AGAIN 'street_number_1'
df_hostel_coordinates7['street_number_1_hostel'] = df_hostel_coordinates7['street_number_1_hostel'].astype(str).str.strip()
df_hostel_coordinates7[df_hostel_coordinates7['street_number_1_hostel'].str.isdecimal()==False].value_counts('street_number_1_hostel') 

Out[705]:

Series([], dtype: int64)

In [706]:

#CHECK AGAIN 'street_number_2'
df_hostel_coordinates7['street_number_2_hostel'] = df_hostel_coordinates7['street_number_2_hostel'].astype(str).str.strip()
df_hostel_coordinates7[df_hostel_coordinates7['street_number_2_hostel'].str.isdecimal()==False].value_counts('street_number_2_hostel')

Out[706]:

street_number_2_hostel
None    227
dtype: int64

In [707]:

df_hostel = df_hostel_coordinates7.copy()

DATAFRAME: COORDINATES FOR OTHER ESTABLISHMENTS

SOURCE:

https://opendata-ajuntament.barcelona.cat/data/en/dataset/allotjaments-altres

In [708]:

url_oe = 'https://www.bcn.cat/tercerlloc/files/allotjament/opendatabcn_allotjament_altres-allotjaments-js.json'
response_oe = requests.get(url_oe)
if response_oe.ok:
    data_oe = response_oe.json()
else:
    print('Problem with: ', url_oe)
df_other_establishments = pd.DataFrame.from_dict(data_oe)

In [709]:

df_other_establishments1 = df_other_establishments.copy()

EXTRACT RELEVANT DATA

In [710]:

#DROP UNNECESSARY COLUMNS
df_other_establishments1.columns

Out[710]:

Index(['register_id', 'prefix', 'suffix', 'name', 'created', 'modified',
       'status', 'status_name', 'core_type', 'core_type_name', 'body',
       'tickets_data', 'addresses', 'entity_types_data',
       'attribute_categories', 'values', 'from_relationships',
       'to_relationships', 'classifications_data', 'secondary_filters_data',
       'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831',
       'geo_epgs_23031', 'geo_epgs_4326', 'is_section_of_data',
       'sections_data', 'start_date', 'end_date', 'estimated_dates',
       'languages_data', 'type', 'type_name', 'period', 'period_name',
       'event_status_name', 'event_status', 'ical'],
      dtype='object')

In [711]:

#CHECK COLUMNS TO DROP
df_other_establishments1[['register_id','prefix', 'suffix', 'created', 'modified','status', 'status_name', 'core_type', 
       'core_type_name', 'body','tickets_data','entity_types_data', 'attribute_categories', 'values']].head(1)

Out[711]:

register_idprefixsuffixcreatedmodifiedstatusstatus_namecore_typecore_type_namebodytickets_dataentity_types_dataattribute_categoriesvalues
075990030639NoneNone1996-09-17T00:00:00+02:002022-09-17T02:42:49.222463+02:00publishedPublicatplaceEquipamentNone[][{‘id’: 102, ‘name’: ‘equipament’}, {‘id’: 100…[{‘id’: 2, ‘name’: ‘Informació d’interès’, ‘…[{‘id’: 34989, ‘value’: ‘calabria@city-hotels….

In [712]:

#CHECK COLUMNS TO DROP
df_other_establishments1[['from_relationships', 'to_relationships', 'timetable', 'image_data', 'gallery_data', 'warnings',
       'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data', 'sections_data', 
       'start_date', 'end_date', 'estimated_dates']].head(1)

Out[712]:

from_relationshipsto_relationshipstimetableimage_datagallery_datawarningsgeo_epgs_25831geo_epgs_23031is_section_of_datasections_datastart_dateend_dateestimated_dates
0[][]NoneNone[][]{‘x’: 429195.2433352358, ‘y’: 4581395.340320568}{‘x’: 429289.7989668916, ‘y’: 4581599.905278732}None[]NoneNoneNone

In [713]:

#CHECK COLUMNS TO DROP
df_other_establishments1[[ 'secondary_filters_data','languages_data', 'type','type_name', 'period', 
       'period_name','event_status_name', 'event_status', 'ical']].head(1)

Out[713]:

secondary_filters_datalanguages_datatypetype_nameperiodperiod_nameevent_status_nameevent_statusical
0[{‘id’: 57245924, ‘name’: ’03. Hotels, pension…NoneNoneNoneNoneNoneNoneNoneBEGIN:VCALENDARrnPRODID:ics.py – http://git….

In [714]:

#DROP COLUMNS
df_other_establishments1.drop(columns=['register_id','prefix', 'suffix', 'created', 'modified', 'status', 'status_name', 
                    'core_type', 'core_type_name', 'body', 'tickets_data', 'from_relationships',
                    'to_relationships','timetable', 'image_data', 'gallery_data', 'warnings',
                    'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data','sections_data', 
                    'start_date', 'end_date', 'estimated_dates', 'languages_data', 'type', 
                    'type_name', 'period', 'period_name', 'event_status_name', 'event_status',
                    'ical','values', 'entity_types_data','attribute_categories','secondary_filters_data'], inplace=True)
df_other_establishments1.head(1)

Out[714]:

nameaddressesclassifications_datageo_epgs_4326
0Apartament Turístic Atenea Calabria – ATB-000001[{‘place’: None, ‘district_name’: ‘Eixample’, …[{‘id’: 1003011, ‘name’: ‘Apartaments turísti…{‘x’: 41.38096727220521, ‘y’: 2.1532132712982266}

In [715]:

df_other_establishments2 = df_other_establishments1.copy()

In [716]:

#FUNCTION TO SPLIT THE COORDINATE COLUMNS INTO TWO SEPARATE COLUMNS
def split_coordinates(arg):
    if "," in arg:
        return arg.split(",",1) # 1 : to split at the first found only
    else:
        return (arg, None) # None : to add a Null value when split character not found and so preserve the same column length

In [717]:

df_other_establishments2['geo_epgs_4326'] = df_other_establishments2['geo_epgs_4326'].astype(str)
df_other_establishments2[['latitude_oe','longitude_oe']] = [split_coordinates(x) for x in df_other_establishments2['geo_epgs_4326']]
df_other_establishments2['latitude_oe'] = df_other_establishments2['latitude_oe'].str.replace("{'x': ",'',regex=True)
df_other_establishments2['longitude_oe'] = df_other_establishments2['longitude_oe'].str.replace("'y': ",'',regex=True).replace('}','', regex=True)
df_other_establishments2.drop(columns=['geo_epgs_4326'],inplace=True)
df_other_establishments2.head(1)

Out[717]:

nameaddressesclassifications_datalatitude_oelongitude_oe
0Apartament Turístic Atenea Calabria – ATB-000001[{‘place’: None, ‘district_name’: ‘Eixample’, …[{‘id’: 1003011, ‘name’: ‘Apartaments turísti…41.380967272205212.1532132712982266

In [718]:

#EXTRACT A CLASSIFICATION COLUMNS FROM THE CLASSIFICATION_DATA COLUMN - THE COLUMN NEEDS TO BE IN OBJECT TYPE FOR THE EXTRACTION TO WORK
df_other_establishments2.loc[0,'classifications_data']

Out[718]:

[{'id': 1003011,
  'name': 'Apartaments turÃxadstics',
  'full_path': 'Tipologia EQ >> Allotjament >> Apartaments turÃxadstics',
  'dependency_group': 3033964,
  'parent_id': 1003,
  'tree_id': 1,
  'asia_id': '0000102003011',
  'core_type': 'place',
  'level': 2},
 {'id': 28793722,
  'name': '3 estrelles',
  'full_path': 'Categories >> Estrelles >> 3 estrelles',
  'dependency_group': 3033964,
  'parent_id': 103001,
  'tree_id': 103,
  'asia_id': '0010302001003',
  'core_type': 'place',
  'level': 2},
 {'id': 105001,
  'name': 'Accessible per a persones amb discapacitat fÃxadsica',
  'full_path': 'Accessibilitat >> Accessible per a persones amb discapacitat fÃxadsica',
  'dependency_group': 3033964,
  'parent_id': 105,
  'tree_id': 105,
  'asia_id': '0010501001',
  'core_type': 'place',
  'level': 1},
 {'id': 72314191,
  'name': 'Hospedaje en aparta-hoteles',
  'full_path': 'Arbre Principal Barcelona Activa  >> Industria, comerc i serveis >> Comercio, rest.y hospedajes reparaciones >> Servicio de hospedaje >> Hospedaje en aparta-hoteles',
  'dependency_group': 2206253,
  'parent_id': 27751630,
  'tree_id': 129,
  'asia_id': '0012904000006008004',
  'core_type': 'event',
  'level': 4},
 {'id': 68601655,
  'name': 'Guardia y custodia vehiculos en parkings',
  'full_path': 'Arbre Principal Barcelona Activa  >> Industria, comerc i serveis >> Transporte y comunicaciones >> Actividades anexas a los transportes >> Actividades anexas transporte terrestre >> Guardia y custodia vehiculos en parkings',
  'dependency_group': 2206253,
  'parent_id': 73311188,
  'tree_id': 129,
  'asia_id': '0012905000007005001002',
  'core_type': 'event',
  'level': 5}]

In [719]:

classifications_data  = []
for i in df_other_establishments2['classifications_data']:
        c = list([x.get('name') for x in i])[0] #to get first item from dictionary with 'name' as key
        classifications_data.append(c)
df_other_establishments2['category_oe'] = classifications_data
df_other_establishments2.drop(columns=['classifications_data'], inplace=True)
df_other_establishments2.head(1)

Out[719]:

nameaddresseslatitude_oelongitude_oecategory_oe
0Apartament Turístic Atenea Calabria – ATB-000001[{‘place’: None, ‘district_name’: ‘Eixample’, …41.380967272205212.1532132712982266Apartaments turístics

In [720]:

df_other_establishments2.loc[0,'addresses']

Out[720]:

[{'place': None,
  'district_name': 'Eixample',
  'district_id': '02',
  'neighborhood_name': "la Nova Esquerra de l'Eixample",
  'neighborhood_id': '09',
  'address_name': 'C CalÃxa0bria',
  'address_id': '054509',
  'block_id': None,
  'start_street_number': 129,
  'end_street_number': None,
  'street_number_1': '129',
  'street_number_2': None,
  'stairs': None,
  'level': None,
  'door': None,
  'zip_code': '08015',
  'province': 'BARCELONA',
  'town': 'BARCELONA',
  'country': 'ESPANYA',
  'comments': None,
  'position': 0,
  'main_address': True,
  'road_name': None,
  'road_id': None,
  'roadtype_name': None,
  'roadtype_id': None,
  'location': {'type': 'GeometryCollection',
   'geometries': [{'type': 'Point',
     'coordinates': [429195.2433352358, 4581395.340320568]}]},
  'related_entity': None,
  'related_entity_data': None,
  'hide_address': False}]

In [721]:

district_name = []
district_id = []
neighborhood_name = []
neighborhood_id = []
address_name  = []
street_number_1 = []
street_number_2 = []
for i in df_other_establishments2['addresses']:
        dn = list([x.get('district_name') for x in i])[0]
        dc = list([x.get('district_id') for x in i])[0]
        nn = list([x.get('neighborhood_name') for x in i])[0]
        nc = list([x.get('neighborhood_id') for x in i])[0]
        an = list([x.get('address_name') for x in i])[0]
        sn1 = list([x.get('street_number_1') for x in i])[0]
        sn2 = list([x.get('street_number_2') for x in i])[0]
        district_name.append(dn)
        district_id.append(dc)
        neighborhood_name.append(nn)
        neighborhood_id.append(nc)
        address_name.append(an)
        street_number_1.append(sn1)
        street_number_2.append(sn2)
df_other_establishments2['district_code_oe'] = district_id
df_other_establishments2['district_name_oe'] = district_name
df_other_establishments2['neighbourhood_code_oe'] = neighborhood_id
df_other_establishments2['neighbourhood_name_oe'] = neighborhood_name
df_other_establishments2['address_name_oe'] = address_name
df_other_establishments2['street_number_1_oe'] = street_number_1
df_other_establishments2['street_number_2_oe'] = street_number_2
df_other_establishments2.drop(columns='addresses', inplace=True)
df_other_establishments2.head(1)

Out[721]:

namelatitude_oelongitude_oecategory_oedistrict_code_oedistrict_name_oeneighbourhood_code_oeneighbourhood_name_oeaddress_name_oestreet_number_1_oestreet_number_2_oe
0Apartament Turístic Atenea Calabria – ATB-00000141.380967272205212.1532132712982266Apartaments turístics02Eixample09la Nova Esquerra de l’EixampleC Calàbria129None

In [722]:

df_other_establishments2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   name                   206 non-null    object
 1   latitude_oe            206 non-null    object
 2   longitude_oe           206 non-null    object
 3   category_oe            206 non-null    object
 4   district_code_oe       206 non-null    object
 5   district_name_oe       206 non-null    object
 6   neighbourhood_code_oe  206 non-null    object
 7   neighbourhood_name_oe  206 non-null    object
 8   address_name_oe        206 non-null    object
 9   street_number_1_oe     206 non-null    object
 10  street_number_2_oe     0 non-null      object
dtypes: object(11)
memory usage: 17.8+ KB

In [723]:

df_other_establishments2['category_oe'].value_counts()

Out[723]:

Albergs                   121
Residències               61
Apartaments turístics     12
Col·legis majors          11
Empreses de serveis         1
Name: category_oe, dtype: int64

In [724]:

df_other_establishments3 = df_other_establishments2.copy()

In [725]:

#RTC ID CODE IS INCLUDED IN MOST NAMES, ADDED AT THE END FOLLOWING A "-" AS SEPARATING CHARACTER
df_other_establishments3['name']

Out[725]:

0      Apartament Turístic Atenea Calabria - ATB-000001
1                 Col.legi  Major Penyafort - Montserrat
2                                Col.legi Major Bonaigua
3            Residència Universitària Mare Anna Ravell
4      Centre d’Acollida del Baix Guinardó per a l...
                             ...                        
201                 Residència Salesiana Martí Codolar
202                 Residència d'Estudiants Mare Güell
203                          Residència Erasmus Gràcia
204         Residència d'Estudiants LIV Student Sarrià
205            Residència Universitària Josep Manyanet
Name: name, Length: 206, dtype: object

In [726]:

#3 RESIDENCIES, 1 COLLEGIS MAJOR AND EMPRESES DE SERVEIS APPEAR TO HAVE THE SEPARATING CHARACTER
df_other_establishments3[['name','category_oe']][(df_other_establishments3['name'].str.contains(' - ')) 
                                                    & (df_other_establishments3['category_oe'] != 'Albergs') 
                                                    & (df_other_establishments3['category_oe'] != 'Apartaments turístics')]

Out[726]:

namecategory_oe
1Col.legi Major Penyafort – MontserratCol·legis majors
91Residència d’Estudiants Vita Student – PoblenouResidències
138Residència d’Estudiants Vita Student – PedralbesResidències
179Residència per a Investigadors CSIC – General…Residències

In [727]:

#HOWEVER, LIKE THE OTHERS, IT IS NOT FOLLOWED BY AN RTC CODE
df_other_establishments3[['name','category_oe']][(df_other_establishments3['category_oe'] != 'Albergs') 
                                                    & (df_other_establishments3['category_oe'] != 'Apartaments turístics')]

Out[727]:

namecategory_oe
1Col.legi Major Penyafort – MontserratCol·legis majors
2Col.legi Major BonaiguaCol·legis majors
3Residència Universitària Mare Anna RavellResidències
4Centre d’Acollida del Baix Guinardó per a l…Residències
12Casa Sant Felip NeriResidències
201Residència Salesiana Martí CodolarResidències
202Residència d’Estudiants Mare GüellResidències
203Residència Erasmus GràciaResidències
204Residència d’Estudiants LIV Student SarriÃResidències
205Residència Universitària Josep ManyanetResidències

73 rows × 2 columns

In [728]:

#AMONG ALBERGS AND APARTMENTS TURISTICS - ONLY 1 APARTMENTS TURISTICS APPEARS TO BE MISSING THE SEPARATING CHARACTER
df_other_establishments3[['name','category_oe']][(~df_other_establishments3['name'].str.contains(' - ')) 
                                                    & ((df_other_establishments3['category_oe'] == 'Albergs') 
                                                    | (df_other_establishments3['category_oe'] == 'Apartaments turístics'))].value_counts('category_oe')

Out[728]:

category_oe
Apartaments turístics    1
dtype: int64

In [729]:

#IN THIS CASE, THE SEPARATING CHARACTER IS FOLLOWED BY AN RTC CODE
df_other_establishments3[['name','category_oe']][(~df_other_establishments3['name'].str.contains(' - ')) 
                                                    & ((df_other_establishments3['category_oe'] == 'Albergs') 
                                                    | (df_other_establishments3['category_oe'] == 'Apartaments turístics'))]

Out[729]:

namecategory_oe
53Apartament Turístic Midtown Apartments- ATB-0…Apartaments turístics

In [730]:

#ALBERGS AND APARTMENTS TURISTIC ALL HAVE A "-" SEPARATING CHARACTER, FOLLOWED BY AN RTC CODE - HOWEVER TO AVOID SPLITTING RECORDS IN OTHER CATEGORIES, IT IS BEST TO USE "- A" AS SPLITTING CHARACTER AND THEN REINSTATE THE A
df_other_establishments3[['name','category_oe']][(df_other_establishments3['name'].str.contains(' - ')) 
                                                    & ((df_other_establishments3['category_oe'] == 'Albergs') 
                                                    | (df_other_establishments3['category_oe'] == 'Apartaments turístics'))].sort_values('category_oe')

Out[730]:

namecategory_oe
79Alberg Fabrizzios Terrace Barcelona – AJ000615Albergs
114Primavera Hostel – AJ000579Albergs
113Alberg Campus del Mar – AJ000520Albergs
112Alberg Coroleu House – AJ000571Albergs
111Alberg Casa Kessler Barcelona B – AJ000582Albergs
16Apartament Turístic Tibidabo Apartments – ATB…Apartaments turístics
89Apartament Turístic Descartes – ATB-000044Apartaments turístics
54Apartament Turístic Hostemplo – ATB-000089Apartaments turístics
186Apartament Turístic DV – ATB-000083Apartaments turístics
0Apartament Turístic Atenea Calabria – ATB-000001Apartaments turístics

132 rows × 2 columns

In [731]:

df_other_establishments4 = df_other_establishments3.copy()

In [732]:

#FUNCTION TO EXTRACT ID COLUMN
def rtc_split_oe(arg):
    if "- A" in arg:
        return arg.split("- A", 1) # 1 : to split at the first found only
    else:
        return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length

In [733]:

df_other_establishments4['name'] = df_other_establishments4['name'].astype(str)
df_other_establishments4[['name_oe','rtc_oe']] = [rtc_split_oe(x) for x in df_other_establishments4['name']]
df_other_establishments4['name_oe'] = df_other_establishments4['name_oe'].str.strip()
df_other_establishments4['rtc_oe'] = "A" + df_other_establishments4['rtc_oe'] #REINSTATE THE 'A'
df_other_establishments4['rtc_oe'] = df_other_establishments4['rtc_oe'].str.strip()
df_other_establishments4.drop(columns=['name'],inplace=True)
df_other_establishments4.head(1)

Out[733]:

latitude_oelongitude_oecategory_oedistrict_code_oedistrict_name_oeneighbourhood_code_oeneighbourhood_name_oeaddress_name_oestreet_number_1_oestreet_number_2_oename_oertc_oe
041.380967272205212.1532132712982266Apartaments turístics02Eixample09la Nova Esquerra de l’EixampleC Calàbria129NoneApartament Turístic Atenea CalabriaATB-000001

In [734]:

#COMPARING THE INITIAL RESULTS
df_other_establishments4['category_oe'].value_counts()

Out[734]:

Albergs                   121
Residències               61
Apartaments turístics     12
Col·legis majors          11
Empreses de serveis         1
Name: category_oe, dtype: int64

In [735]:

#NOW THE RTC IS NOT PRESENT IN THESE CATEGORIES
df_other_establishments4['category_oe'][df_other_establishments4['rtc_oe'].isnull()].value_counts()

Out[735]:

Residències           61
Col·legis majors      11
Empreses de serveis     1
Name: category_oe, dtype: int64

In [736]:

#AND IT IS PRESENT IN THESE CATEGORIES
df_other_establishments4['category_oe'][df_other_establishments4['rtc_oe'].notnull()].value_counts()

Out[736]:

Albergs                   121
Apartaments turístics     12
Name: category_oe, dtype: int64

THE FOCUS OF THIS ANALYSIS IS ON ESTABLISHMENTS FOR T0URISTS
THEREFORE ONLY ALBERGS AND APARTMENTS TURISTICS WILL BE RETAINED

In [737]:

df_other_establishments5 = df_other_establishments4.copy()

In [738]:

#DATAFRAME FOR "Apartaments turístics" - at
df_touristapartment_coordinates = df_other_establishments5[df_other_establishments5['category_oe']=='Apartaments turístics']
df_touristapartment_coordinates['category_oe'].value_counts()

Out[738]:

Apartaments turístics    12
Name: category_oe, dtype: int64

In [739]:

#DATAFRAME FOR "ALBERGS" - al
df_alberg_coordinates = df_other_establishments5[df_other_establishments5['category_oe']=='Albergs']
df_alberg_coordinates['category_oe'].value_counts()

Out[739]:

Albergs    121
Name: category_oe, dtype: int64

DATAFRAME: COORDINATES FOR “APARTAMENTS TURISTICS” – _touristapartment (FROM OTHER ESTABLISHMENTS)

SOURCE:

SECTION OTHER ESTABLISHMENT: df_at

In [740]:

df_at = df_touristapartment_coordinates.copy()

In [741]:

df_at.columns

Out[741]:

Index(['latitude_oe', 'longitude_oe', 'category_oe', 'district_code_oe',
       'district_name_oe', 'neighbourhood_code_oe', 'neighbourhood_name_oe',
       'address_name_oe', 'street_number_1_oe', 'street_number_2_oe',
       'name_oe', 'rtc_oe'],
      dtype='object')

In [742]:

#RENAME COLUMNS
df_at.columns = ['latitude_at', 'longitude_at', 'category_at', 'district_code_at',
       'district_name_at', 'neighbourhood_code_at', 'neighbourhood_name_at',
       'address_name_at', 'street_number_1_at', 'street_number_2_at', 'name_at', 'rtc_at']
df_at.head(1)

Out[742]:

latitude_atlongitude_atcategory_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at
041.380967272205212.1532132712982266Apartaments turístics02Eixample09la Nova Esquerra de l’EixampleC Calàbria129NoneApartament Turístic Atenea CalabriaATB-000001
WHITESPACES

In [743]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_at['rtc_at'] = df_at['rtc_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_at['category_at'] = df_at['category_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_at['address_name_at'] = df_at['address_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_at['street_number_1_at'] = df_at['street_number_1_at'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_at['street_number_2_at'] = df_at['street_number_2_at'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_at['district_code_at'] = df_at['district_code_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_at['district_name_at'] = df_at['district_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_at['neighbourhood_code_at'] = df_at['neighbourhood_code_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_at['neighbourhood_name_at'] = df_at['neighbourhood_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_at['longitude_at'] = df_at['longitude_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_at['latitude_at'] = df_at['latitude_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_at['name_at'] = df_at['name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)

In [744]:

#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10
df_at[['district_code_at']] = df_at[['district_code_at']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_at[['neighbourhood_code_at']] = df_at[['neighbourhood_code_at']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [745]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_at = df_at.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)

In [746]:

df_at.head(1)

Out[746]:

latitude_atlongitude_atcategory_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at
041.380967272205212.1532132712982266Apartaments turístics02Eixample09la Nova Esquerra de l’EixampleC Calà bria129NoneApartament Turístic Atenea CalabriaATB-000001

In [747]:

df_at1 = df_at.copy()

DUPLICATES

In [748]:

#PRELIMINARY CHECK
df_at1.duplicated().value_counts()

Out[748]:

False    12
dtype: int64

In [749]:

#CHECK ON ID COLUMN
df_at1[df_at1.duplicated(subset=['rtc_at'], keep=False)]

Out[749]:

latitude_atlongitude_atcategory_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at

MISSING VALUES

In [750]:

df_at1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 195
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   latitude_at            12 non-null     object
 1   longitude_at           12 non-null     object
 2   category_at            12 non-null     object
 3   district_code_at       12 non-null     object
 4   district_name_at       12 non-null     object
 5   neighbourhood_code_at  12 non-null     object
 6   neighbourhood_name_at  12 non-null     object
 7   address_name_at        12 non-null     object
 8   street_number_1_at     12 non-null     object
 9   street_number_2_at     0 non-null      object
 10  name_at                12 non-null     object
 11  rtc_at                 12 non-null     object
dtypes: object(12)
memory usage: 1.2+ KB

In [751]:

df_at1['category_at'].value_counts()

Out[751]:

Apartaments turístics    12
Name: category_at, dtype: int64
RTC

In [752]:

#IDENTIFY NULL VALUES
df_at1[(df_at1['rtc_at'].isnull()) 
       | (df_at1['rtc_at']=='nan') 
       | (df_at1['rtc_at']==None) 
       | (df_at1['rtc_at']=='')].shape[0]

Out[752]:

0
DISTRICT – NEIGHBOURHOOD

In [753]:

#IDENTIFY NULL VALUES
df_at1[(df_at1['neighbourhood_code_at'].isnull()) 
       | (df_at1['neighbourhood_code_at']=='nan') 
       |  (df_at1['neighbourhood_code_at']==None) 
       |  (df_at1['neighbourhood_code_at']=='')].shape[0]

Out[753]:

0

In [754]:

#IDENTIFY NULL VALUES
df_at1[(df_at1['neighbourhood_name_at'].isnull()) 
       | (df_at1['neighbourhood_name_at']=='nan') 
       |  (df_at1['neighbourhood_name_at']==None) 
       |  (df_at1['neighbourhood_name_at']=='')].shape[0]

Out[754]:

0

In [755]:

#CHECK IF NEIGHBOURHOOD NAMES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_at1[~df_at1['neighbourhood_name_at'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[755]:

5

In [756]:

#CHECK IF NEIGHBOURHOOD CODES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_at1[~df_at1['neighbourhood_code_at'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[756]:

0

In [757]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_at1['neighbourhood_name_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name'])
df_at1['district_code_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code'])
df_at1['district_name_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])

In [758]:

#CHECK IF NEIGHBOURHOOD NAMES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_at1[~df_at1['neighbourhood_name_at'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[758]:

0

In [759]:

df_at2 = df_at1.copy()

NORMALIZATION

In [760]:

df_at2.columns

Out[760]:

Index(['latitude_at', 'longitude_at', 'category_at', 'district_code_at',
       'district_name_at', 'neighbourhood_code_at', 'neighbourhood_name_at',
       'address_name_at', 'street_number_1_at', 'street_number_2_at',
       'name_at', 'rtc_at'],
      dtype='object')

In [761]:

#CHECK 'street_number_1'
df_at2[df_at2['street_number_1_at'].str.isdecimal()==False].value_counts('street_number_1_at') 

Out[761]:

street_number_1_at
276*280    1
dtype: int64

In [762]:

#CHECK 'street_number_2'
df_at2[df_at2['street_number_2_at'].str.isdecimal()==False].value_counts('street_number_2_at')

Out[762]:

Series([], dtype: int64)

In [763]:

def split_street_number_1_2 (df_target,column_address_1,column_address_2):
    df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1]
    df = df.str.split(pat='(d+)', expand=True)
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost
    return print('Split Values:'), df

In [764]:

split_street_number_1_2(df_at2,'street_number_1_at','street_number_2_at')
Split Values:

Out[764]:

(None,
    0    1  2    3 4
 54    276  *  280  )

In [765]:

#CHECK AGAIN 'street_number_1'
df_at2[df_at2['street_number_1_at'].str.isdecimal()==False].value_counts('street_number_1_at') 

Out[765]:

Series([], dtype: int64)

In [766]:

#CHECK AGAIN 'street_number_2'
df_at2[df_at2['street_number_2_at'].str.isdecimal()==False].value_counts('street_number_2_at')

Out[766]:

Series([], dtype: int64)

In [767]:

df_at2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 195
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   latitude_at            12 non-null     object
 1   longitude_at           12 non-null     object
 2   category_at            12 non-null     object
 3   district_code_at       12 non-null     object
 4   district_name_at       12 non-null     object
 5   neighbourhood_code_at  12 non-null     object
 6   neighbourhood_name_at  12 non-null     object
 7   address_name_at        12 non-null     object
 8   street_number_1_at     12 non-null     object
 9   street_number_2_at     1 non-null      object
 10  name_at                12 non-null     object
 11  rtc_at                 12 non-null     object
dtypes: object(12)
memory usage: 1.2+ KB

In [768]:

df_touristapartment = df_at2.copy()

DATAFRAME: COORDINATES FOR “ALBERGS” – alberg (FROM OTHER ESTABLISHMENTS)

SOURCE:

SECTION OTHER ESTABLISHMENT: df_albergs

In [769]:

df_al = df_alberg_coordinates.copy()

In [770]:

df_al.columns

Out[770]:

Index(['latitude_oe', 'longitude_oe', 'category_oe', 'district_code_oe',
       'district_name_oe', 'neighbourhood_code_oe', 'neighbourhood_name_oe',
       'address_name_oe', 'street_number_1_oe', 'street_number_2_oe',
       'name_oe', 'rtc_oe'],
      dtype='object')

In [771]:

#RENAME COLUMNS
df_al.columns = ['latitude_al', 'longitude_al', 'category_al', 'district_code_al',
       'district_name_al', 'neighbourhood_code_al', 'neighbourhood_name_al',
       'address_name_al', 'street_number_1_al', 'street_number_2_al', 'name_al', 'rtc_al']
df_al.head(1)

Out[771]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_al
641.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneAlberg Arc HouseAJ000645
WHITESPACES

In [772]:

#REMOVING SPACES
# .replace(' ','', regex=True) - replace all spaces with nothing
# .str.strip() - replace 1 initial and 1 trailing space only
# .replace(r's+',' ', regex=True) - replace multiple spaces with one single space
# .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $
# .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string
# .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one
df_al['rtc_al'] = df_al['rtc_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_al['category_al'] = df_al['category_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_al['address_name_al'] = df_al['address_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_al['street_number_1_al'] = df_al['street_number_1_al'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_al['street_number_2_al'] = df_al['street_number_2_al'].astype(str).replace(r'','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_al['district_code_al'] = df_al['district_code_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_al['district_name_al'] = df_al['district_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_al['neighbourhood_code_al'] = df_al['neighbourhood_code_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True)
df_al['neighbourhood_name_al'] = df_al['neighbourhood_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
df_al['longitude_al'] = df_al['longitude_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_al['latitude_al'] = df_al['latitude_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True)
df_al['name_al'] = df_al['name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)

In [773]:

#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10
df_al[['district_code_al']] = df_al[['district_code_al']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
df_al[['neighbourhood_code_al']] = df_al[['neighbourhood_code_al']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])

In [774]:

#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None
df_al = df_al.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)

In [775]:

df_al1 = df_al.copy()

DUPLICATES

In [776]:

#PRELIMINARY CHECK
df_al1.duplicated().value_counts()

Out[776]:

False    121
dtype: int64

In [777]:

#CHECK ON ID COLUMN
df_al1[df_al1.duplicated(subset=['rtc_al'], keep=False)]

Out[777]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_al

MISSING VALUES

In [778]:

df_al1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 121 entries, 6 to 200
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   latitude_al            121 non-null    object
 1   longitude_al           121 non-null    object
 2   category_al            121 non-null    object
 3   district_code_al       121 non-null    object
 4   district_name_al       121 non-null    object
 5   neighbourhood_code_al  121 non-null    object
 6   neighbourhood_name_al  121 non-null    object
 7   address_name_al        121 non-null    object
 8   street_number_1_al     121 non-null    object
 9   street_number_2_al     0 non-null      object
 10  name_al                121 non-null    object
 11  rtc_al                 121 non-null    object
dtypes: object(12)
memory usage: 12.3+ KB
RTC

In [779]:

#IDENTIFY NULL VALUES
df_al1[(df_al1['rtc_al'].isnull()) 
       | (df_al1['rtc_al']=='nan') 
       | (df_al1['rtc_al']==None) 
       | (df_al1['rtc_al']=='')].shape[0]

Out[779]:

0
DISTRICT – NEIGHBOURHOOD

In [780]:

#IDENTIFY NULL VALUES
df_al1[(df_al1['neighbourhood_code_al'].isnull()) 
       | (df_al1['neighbourhood_code_al']=='nan') 
       | (df_al1['neighbourhood_code_al']==None) 
       | (df_al1['neighbourhood_code_al']=='')].shape[0]

Out[780]:

0

In [781]:

#IDENTIFY NULL VALUES
df_al1[(df_al1['neighbourhood_name_al'].isnull()) 
      | (df_al1['neighbourhood_name_al']=='nan') 
      | (df_al1['neighbourhood_name_al']==None) 
      | (df_al1['neighbourhood_name_al']=='')].shape[0]

Out[781]:

0

In [782]:

#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_al1[~df_al1['neighbourhood_name_al'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[782]:

17

In [783]:

#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_al1[~df_al1['neighbourhood_code_al'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[783]:

0

In [784]:

#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN
df_al1['neighbourhood_name_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name'])
df_al1['district_code_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code'])
df_al1['district_name_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])

In [785]:

#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_al1[~df_al1['neighbourhood_name_al'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]

Out[785]:

0

In [786]:

#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE
df_al1[~df_al1['neighbourhood_code_al'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]

Out[786]:

0

NORMALIZATION

In [787]:

df_al2 = df_al1.copy()

In [788]:

#CHECK 'street_number_1'
df_al2[df_al2['street_number_1_al'].str.isdecimal()==False].value_counts('street_number_1_al') 

Out[788]:

street_number_1_al
51*LB      2
6*8        2
116*LB     1
8*10       1
70*74      1
58*60      1
56*58      1
55*57      1
52*54      1
48*52      1
149*151    1
45*47      1
41*51      1
402*404    1
38*42      1
33*LB      1
21*23      1
17*19      1
86*88      1
dtype: int64

In [789]:

#CHECK 'street_number_2'
df_al2[df_al2['street_number_2_al'].str.isdecimal()==False].value_counts('street_number_2_al')

Out[789]:

Series([], dtype: int64)

In [790]:

def split_street_number_1_2 (df_target,column_address_1,column_address_2):
    df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1]
    df = df.str.split(pat='(d+)', expand=True)
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 
    df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost
    return print('Split Values:'), df

In [791]:

split_street_number_1_2(df_al2,'street_number_1_al','street_number_2_al')
Split Values:

Out[791]:

(None,
     0    1    2     3     4
 7       56    *    58      
 11      58    *    60      
 18      17    *    19      
 19      45    *    47      
 28      55    *    57      
 39       6    *     8      
 43      21    *    23      
 64     402    *   404      
 67      48    *    52      
 94      51  *LB  None  None
 110    116  *LB  None  None
 112     33  *LB  None  None
 115     51  *LB  None  None
 130      6    *     8      
 140     70    *    74      
 141     41    *    51      
 159     38    *    42      
 160     52    *    54      
 183    149    *   151      
 185     86    *    88      
 192      8    *    10      )

In [792]:

#CHECK AGAIN 'street_number_1'
df_al2[df_al2['street_number_1_al'].str.isdecimal()==False].value_counts('street_number_1_al') 

Out[792]:

Series([], dtype: int64)

In [793]:

#CHECK AGAIN 'street_number_2'
df_al2[df_al2['street_number_2_al'].str.isdecimal()==False].value_counts('street_number_2_al')

Out[793]:

Series([], dtype: int64)

In [794]:

df_al2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 121 entries, 6 to 200
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   latitude_al            121 non-null    object
 1   longitude_al           121 non-null    object
 2   category_al            121 non-null    object
 3   district_code_al       121 non-null    object
 4   district_name_al       121 non-null    object
 5   neighbourhood_code_al  121 non-null    object
 6   neighbourhood_name_al  121 non-null    object
 7   address_name_al        121 non-null    object
 8   street_number_1_al     121 non-null    object
 9   street_number_2_al     17 non-null     object
 10  name_al                121 non-null    object
 11  rtc_al                 121 non-null    object
dtypes: object(12)
memory usage: 12.3+ KB

PROBLEM: THE RTC CODES APPEAR IN A DIFFERENT FORMAT

In [795]:

df_n_places[df_n_places['category']=='Albergs'].head(1)

Out[795]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
11302-2014-2576ALB-472NoneAlbergsAV DIAGONAL 436AVDIAGONAL436NoneNoneNoneNoneNone02Eixample07la Dreta de l’EixampleNoneNone24.0

1 rows × 23 columns

In [796]:

df_al2.head(1)

Out[796]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_al
641.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneAlberg Arc HouseAJ000645

VERIFY ADDRESS MATCH

In [797]:

df_n_places_albergs_only = df_n_places[df_n_places['category']=='Albergs']
df_n_places_albergs_only.head(1)

Out[797]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
11302-2014-2576ALB-472NoneAlbergsAV DIAGONAL 436AVDIAGONAL436NoneNoneNoneNoneNone02Eixample07la Dreta de l’EixampleNoneNone24.0

1 rows × 23 columns

In [798]:

df_al_verify_address_match = df_al2.copy()

In [799]:

df_n_places_al_verify_address_match = df_n_places_albergs_only.copy()

In [800]:

#TO MERGE ON PARTIAL ADDRESS:
#STRIP DIGITS FROM
df_n_places_al_verify_address_match['address_verify'] = df_n_places_al_verify_address_match['address'].replace(r'd+','',regex=True).replace(r's+',' ',regex=True).str.strip()
#UPPER CASE THE 'address_name_al'
df_al_verify_address_match['address_verify_al'] = df_al_verify_address_match['address_name_al'].str.upper()

In [801]:

df_n_places_al_verify_address_match['address_verify']

Out[801]:

113             AV DIAGONAL
148             AV DIAGONAL
149             AV DIAGONAL
265               AV ICARIA
336            AV MERIDIANA
                ...        
10125      RDA SANT PERE PR
10160    RDA UNIVERSITAT EN
10330           VIA AUGUSTA
10331           VIA AUGUSTA
10338             VIA JULIA
Name: address_verify, Length: 125, dtype: object

In [802]:

df_verify_address_match = pd.merge(df_n_places_al_verify_address_match,df_al_verify_address_match, how='inner', 
                                   left_on=['address_verify','street_number_1','neighbourhood_code'],
                                   right_on=['address_verify_al','street_number_1_al','neighbourhood_code_al'])
df_verify_address_match[['rtc','rtc_al','street_number_1','street_number_1_al','neighbourhood_code','neighbourhood_code_al']]

Out[802]:

rtcrtc_alstreet_number_1street_number_1_alneighbourhood_codeneighbourhood_code_al
0ALB-472AJ0004724364360707
1ALB-562AJ0005625785782626
2ALB-562AJ0005625785782626
3ALB-491AJ00049197976565
4ALB-460AJ00046052523636
5ALB-529AJ00052912121111
6ALB-593AJ00059375750808
7ALB-539AJ000539330707
8ALB-608AJ00060865650707
9ALB-665AJ00066548481111
10ALB-496AJ00049652520808
11ALB-605AJ00060530301515
12ALB-639AJ0006393553550707
13ALB-15AJ00001556562323
14ALB-471AJ00047117170202
15ALB-517AJ000517556464
16ALB-670AJ0006701761760707
17ALB-512AJ000512881111
18ALB-537AJ00053717173131
19ALB-651AJ0006512372370606
20ALB-427AJ000427222222
21ALB-635AJ0006352902900707
22ALB-532AJ00053270701818
23ALB-598AJ00059838383131
24ALB-440AJ00044091910101
25ALB-559AJ000559551818
26ALB-486AJ00048623231111
27ALB-638AJ00063820201111
28ALB-535AJ00053535353131
29ALB-682AJ00068243431818
30ALB-667AJ00066720201818
31ALB-442AJ000442550404
32ALB-568AJ0005685635630808
33ALB-580AJ0005806286280707
34ALB-520AJ000520440303
35ALB-531AJ000531552020
36ALB-609AJ00060964640707
37ALB-611AJ00061151511010
38ALB-587AJ00058756560707
39ALB-631AJ00063165652626
40ALB-631AJ00063265652626
41ALB-632AJ00063165652626
42ALB-632AJ00063265652626

BY MERGING THE DATAFRAMES ON PARTIAL ADDRESS, THE RTC CODES APPEAR TO MATCH EXCEPT FOR THE STARTING LETTERS
THEREFORE:
MODIFY THE RTC CODES ON THE STARTING LETTERS AND VERIFY RTC MATCH:

In [803]:

df_n_places_al_verify_rtc_match = df_n_places[df_n_places['category']=='Albergs']
df_n_places_al_verify_rtc_match.head(1)

Out[803]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
11302-2014-2576ALB-472NoneAlbergsAV DIAGONAL 436AVDIAGONAL436NoneNoneNoneNoneNone02Eixample07la Dreta de l’EixampleNoneNone24.0

1 rows × 23 columns

In [804]:

df_al_verify_rtc_match = df_al2.copy()

In [805]:

df_al_verify_rtc_match['rtc_al_modified'] = df_al_verify_rtc_match['rtc_al'].astype(str).str.replace('AJ000','ALB-')
df_al_verify_rtc_match.head(1)

Out[805]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_alrtc_al_modified
641.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneAlberg Arc HouseAJ000645ALB-645

In [806]:

df_verify_rtc_match = df_al_verify_rtc_match.merge(df_n_places_al_verify_rtc_match, 
                          how='inner', left_on=['rtc_al_modified'], right_on=['rtc'])

In [807]:

df_verify_rtc_match

Out[807]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alstairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
041.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneNone1210Sant Martí66el Parc i la Llacuna del PoblenouNoneNone18.0
141.392973401607992.125102845599096Albergs05Sarrià-Sant Gervasi23SarriàC Capità Arenas5658NoneNoneNone05Sarrià-Sant Gervasi23SarriàNoneNone219.0
241.379248593728962.137341737796588Albergs03Sants-Montjuïc18SantsC Vallespir34NoneNoneBJNone03Sants-Montjuïc18SantsNoneNone13.0
341.3733349520610362.1657931149717276Albergs03Sants-Montjuïc11el Poble-secC SalvÃ36NoneNoneNoneNone03Sants-Montjuïc11el Poble-secNoneNone60.0
441.392063711351322.1705001263052957Albergs02Eixample07la Dreta de l’EixampleC Roger de Llúria40NoneNone1102Eixample07la Dreta de l’EixampleNoneNone19.0
10941.4220637237985162.101812736653692Albergs05Sarrià-Sant Gervasi22Vallvidrera, el Tibidabo i les PlanesC Major del Rectoret2NoneNoneNoneNone05Sarrià-Sant Gervasi22Vallvidrera, el Tibidabo i les PlanesNoneNone247.0
11041.37359561332172.169147595451214Albergs03Sants-Montjuïc11el Poble-secC Lafont810NoneNoneNone03Sants-Montjuïc11el Poble-secNoneNone148.0
11141.3758887855384862.171010168799885Albergs01Ciutat Vella01el RavalC Nou de la Rambla91NoneNoneNoneNone01Ciutat Vella01el RavalNoneNone100.0
11241.381014339173562.1749336893605618Albergs01Ciutat Vella02el Barri GòticC Ferran17NoneNoneNoneNone01Ciutat Vella02el Barri GòticNoneNone151.0
11341.3946741239228542.1693004756249827Albergs02Eixample07la Dreta de l’EixampleC Bruc94NoneNoneEN202Eixample07la Dreta de l’EixampleNoneNone16.0

114 rows × 36 columns

THE MODIFIED RTC COLUMN IS ADDED TO df_al TO ENABLE MERGING

In [808]:

df_al2['rtc_al_modified'] = df_al2['rtc_al'].str.replace('AJ000','ALB-')

In [809]:

df_al2.head(1)

Out[809]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_alrtc_al_modified
641.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneAlberg Arc HouseAJ000645ALB-645

In [810]:

df_alberg = df_al2.copy()

MERGE WITH PANDAS + SQL:

– pandasql

– duckdb

REQUIRED LIBRARIES

In [811]:

#FOR SQL ELABORATION  - 2 alternative PANDAS libraries:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
import duckdb

USEFUL RESOURCES

https://towardsdatascience.com/query-pandas-dataframe-with-sql-2bb7a509793d

https://hex.tech/blog/how-to-write-sql-in-pandas/

MERGE WITH PANDAS:

– pandas.merge()

MERGE 1 : df_n_places + df_hut : _m1

In [812]:

df_n_places_m1_0 = df_n_places.copy()
df_n_places_m1_0.head(1)

Out[812]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7nanGRAVINA5None7NoneNoneNone01Ciutat Vella01el RavalNoneNone86.0

1 rows × 23 columns

In [813]:

df_hut_m1_0 = df_hut.copy()
df_hut_m1_0.head(1)

Out[813]:

n_practice_hutdistrict_code_hutdistrict_name_hutneighbourhood_code_hutneighbourhood_name_hutstreet_type_hutstreet_hutstreet_number_1_hutstreet_letter_1_hutstreet_number_2_hutblock_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut
003-2010-043703Sants-Montjuïc16la BordetaCarrerCONSTITUCIO127None129NoneNoneNone42HUTB-0035027.02.13221459641.36719526Habitatges d’Ús Turístic

1 rows × 21 columns

In [814]:

df_n_places_m1_0 = df_n_places_m1_0.merge(df_hut_m1_0, how='inner', left_on=['n_practice'], right_on=['n_practice_hut'])
df_n_places_m1_0['category'].value_counts()

Out[814]:

Habitatges d'Ús Turístic    9409
Name: category, dtype: int64

In [815]:

df_n_places_m1_1 = df_n_places_m1_0.copy()

In [816]:

df_n_places_m1_1.columns

Out[816]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'n_practice_hut', 'district_code_hut', 'district_name_hut',
       'neighbourhood_code_hut', 'neighbourhood_name_hut', 'street_type_hut',
       'street_hut', 'street_number_1_hut', 'street_letter_1_hut',
       'street_number_2_hut', 'street_letter_2_hut', 'block_hut',
       'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'rtc_hut',
       'n_places_hut', 'longitude_hut', 'latitude_hut', 'name_hut'],
      dtype='object')

REMAINING RECORDS

N_PRACTICE

In [817]:

#RECORDS NOT INCLUDED
df_hut_remaining = df_n_places_m1_1[((~df_n_places_m1_1['n_practice_hut'].isin(df_n_places_m1_1['n_practice'])) | (df_hut_m1_0['n_practice_hut'].isnull()))]
df_hut_remaining

Out[817]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2block_hutentrance_hutstair_hutfloor_hutdoor_hutrtc_hutn_places_hutlongitude_hutlatitude_hutname_hut

0 rows × 44 columns

In [818]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['n_practice_hut'], inplace = True)

COMPARISON

RTC

In [819]:

#CHECK DIFFERENCES
df_n_places_m1_1[['rtc','rtc_hut']][(df_n_places_m1_1['rtc']!= df_n_places_m1_1['rtc_hut']) 
                                    & (df_n_places_m1_1['rtc_hut'].notnull())]

Out[819]:

rtcrtc_hut

In [820]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['rtc_hut'], inplace = True)

N_PLACES

In [821]:

#CHECK DIFFERENCES
df_n_places_m1_1[['n_places','n_places_hut']][(df_n_places_m1_1['n_places']!= df_n_places_m1_1['n_places_hut']) 
                                              & (df_n_places_m1_1['n_places_hut'].notnull())]

Out[821]:

n_placesn_places_hut

In [822]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['n_places_hut'], inplace = True)

NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME

In [823]:

#CHECK DIFFERENCES 
df_n_places_m1_1[['district_name','district_name_hut']][(df_n_places_m1_1['district_name']!= df_n_places_m1_1['district_name_hut']) 
                                                        & (df_n_places_m1_1['district_name_hut'].notnull())]

Out[823]:

district_namedistrict_name_hut

In [824]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['district_name_hut'], inplace = True)

In [825]:

#CHECK DIFFERENCES 
df_n_places_m1_1[['district_code','district_code_hut']][(df_n_places_m1_1['district_code']!= df_n_places_m1_1['district_code_hut']) 
                                                        & (df_n_places_m1_1['district_code_hut'].notnull())]

Out[825]:

district_codedistrict_code_hut

In [826]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['district_code_hut'], inplace = True)

In [827]:

#CHECK DIFFERENCES 
df_n_places_m1_1[['neighbourhood_code','neighbourhood_code_hut']][(df_n_places_m1_1['neighbourhood_code']!= df_n_places_m1_1['neighbourhood_code_hut']) 
                                                                  & (df_n_places_m1_1['neighbourhood_code_hut'].notnull())]

Out[827]:

neighbourhood_codeneighbourhood_code_hut

In [828]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['neighbourhood_code_hut'], inplace = True)

In [829]:

#CHECK DIFFERENCES
df_n_places_m1_1[['neighbourhood_name','neighbourhood_name_hut']][(df_n_places_m1_1['neighbourhood_name']!= df_n_places_m1_1['neighbourhood_name_hut']) 
                                                                  & (df_n_places_m1_1['neighbourhood_name_hut'].notnull())]

Out[829]:

neighbourhood_nameneighbourhood_name_hut

In [830]:

#DROP ADDED COLUMN
df_n_places_m1_1.drop(columns=['neighbourhood_name_hut'], inplace = True)

COLUMN FILLING

In [831]:

df_n_places_m1_2 = df_n_places_m1_1.copy()

In [832]:

#longitude
df_n_places_m1_2.loc[df_n_places_m1_2['longitude_hut'].notnull(),'longitude'] = df_n_places_m1_2['longitude_hut']

In [833]:

#latitude
df_n_places_m1_2.loc[df_n_places_m1_2['latitude_hut'].notnull(),'latitude'] = df_n_places_m1_2['latitude_hut']

In [834]:

#name
df_n_places_m1_2.loc[df_n_places_m1_2['name_hut'].notnull(),'name'] = df_n_places_m1_2['name_hut']

In [835]:

df_n_places_m1_2.columns

Out[835]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'street_type_hut', 'street_hut', 'street_number_1_hut',
       'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut',
       'block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut',
       'longitude_hut', 'latitude_hut', 'name_hut'],
      dtype='object')

In [836]:

#DROP 
df_n_places_m1_2.drop(columns=['street_type_hut', 'street_hut',
       'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut',
       'street_letter_2_hut', 'block_hut', 'entrance_hut', 'stair_hut',
       'floor_hut', 'door_hut', 'longitude_hut', 'latitude_hut', 'name_hut'], inplace=True)

In [837]:

df_n_places_m1_3 = df_n_places_m1_2.copy()

MISSING VALUES

DISTRICT CODE, DISTRICT NAME, NEIGHBOURHOOD CODE, NEIGHBOURHOOD NAME

In [838]:

#CHECK IF THERE NULL VALUES BETWEEN 'district_name','district_code','neighbourhood_name','neighbourhood_code' - longitude IS USED TO LIMIT SEARCH ON ADDED VALUES
df_n_places_m1_3[['n_practice','rtc','address','district_name','district_code','neighbourhood_name','neighbourhood_code']][(df_n_places_m1_3['longitude'].notnull()) 
                                                                                                                           & ((df_n_places_m1_3['district_name'].isnull()) 
                                                                                                                           | (df_n_places_m1_3['district_code'].isnull()) 
                                                                                                                           | (df_n_places_m1_3['neighbourhood_name'].isnull()) 
                                                                                                                           | (df_n_places_m1_3['neighbourhood_code'].isnull()))]

Out[838]:

n_practicertcaddressdistrict_namedistrict_codeneighbourhood_nameneighbourhood_code

N_PLACES

In [839]:

#CHECK WHICH CATEGORIES HAVE MISSING VALUES OF INTEREST - N PLACES
df_n_places_m1_3['category'][(df_n_places_m1_3['n_places'].isnull())
                             | (df_n_places_m1_3['n_places']==None)
                             | (df_n_places_m1_3['n_places']=='nan')
                             | (df_n_places_m1_3['n_places']=='')].value_counts()

Out[839]:

Habitatges d'Ús Turístic    13
Name: category, dtype: int64

In [840]:

#FOCUS ON Habitatges d'Ús Turístic - SEE GENERAL FEATURES OF THE CATEGORY WITH MISSING DATA ON THE VARIABLE OF INTEREST - N_PLACES
df_n_places_m1_3['n_places'][df_n_places_m1_3['category']=="Habitatges d'Ús Turístic"].describe()

Out[840]:

count    9396.000000
mean        6.038101
std         3.499751
min         1.000000
25%         4.000000
50%         5.000000
75%         7.000000
max        79.000000
Name: n_places, dtype: float64

In [841]:

#REPLACE THE MISSING VALUES ON N_PLACES WITH THE MEDIAN FOR THE SAME CATEGORY
df_n_places_m1_3.loc[df_n_places_m1_3['category']=="Habitatges d'Ús Turístic",'n_places'] = df_n_places_m1_3.loc[df_n_places_m1_3['category']=="Habitatges d'Ús Turístic",'n_places'].fillna(df_n_places_m1_3.groupby('category')['n_places'].transform('median'))

In [842]:

#CHECK FILLING IN OF MISSING VALUES OF INTEREST - N PLACES
df_n_places_m1_3['category'][(df_n_places_m1_3['n_places'].isnull())
                             | (df_n_places_m1_3['n_places']==None)
                             | (df_n_places_m1_3['n_places']=='nan')
                             | (df_n_places_m1_3['n_places']=='')].value_counts()

Out[842]:

Series([], Name: category, dtype: int64)

The remaining missing values belong to another category. No action is taken now as it might be possible to recover that information later from the table referring to that category.

In [843]:

df_n_places_m1_3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9409 entries, 0 to 9408
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          9409 non-null   object 
 1   rtc                 9409 non-null   object 
 2   name                9409 non-null   object 
 3   category            9409 non-null   object 
 4   address             9409 non-null   object 
 5   street_type         9409 non-null   object 
 6   street              9409 non-null   object 
 7   street_number_1     9409 non-null   object 
 8   street_letter_1     116 non-null    object 
 9   street_number_2     747 non-null    object 
 10  street_letter_2     3 non-null      object 
 11  block               10 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               689 non-null    object 
 14  floor               9378 non-null   object 
 15  door                8528 non-null   object 
 16  district_code       9409 non-null   object 
 17  district_name       9409 non-null   object 
 18  neighbourhood_code  9409 non-null   object 
 19  neighbourhood_name  9409 non-null   object 
 20  longitude           9409 non-null   object 
 21  latitude            9409 non-null   object 
 22  n_places            9409 non-null   float64
dtypes: float64(1), object(22)
memory usage: 1.7+ MB

df_hut_n_places_coordinates

In [844]:

df_hut_n_places_coordinates = df_n_places_m1_3.copy()

MERGE 2 : df_n_places + df_hotel : _m2

In [845]:

df_n_places_m2_0 = df_n_places.copy()
df_n_places_m2_0.head(1)

Out[845]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7nanGRAVINA5None7NoneNoneNone01Ciutat Vella01el RavalNoneNone86.0

1 rows × 23 columns

In [846]:

df_hotel_m2_0 = df_hotel.copy()
df_hotel_m2_0.head(1)

Out[846]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
0NoneC Rambla2.17063883139540341.38514182378773Hotels 1 estr.Ciutat Vella1380102Hotel Toledanoel Barri GòticHB-000480

In [847]:

df_hotel_m2_0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 441 entries, 0 to 440
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   street_number_2_hotel     13 non-null     object
 1   address_hotel             441 non-null    object
 2   longitude_hotel           441 non-null    object
 3   latitude_hotel            441 non-null    object
 4   category_hotel            441 non-null    object
 5   district_name_hotel       441 non-null    object
 6   street_number_1_hotel     439 non-null    object
 7   district_code_hotel       441 non-null    object
 8   neighbourhood_code_hotel  441 non-null    object
 9   name_hotel                441 non-null    object
 10  neighbourhood_name_hotel  441 non-null    object
 11  rtc_hotel                 440 non-null    object
dtypes: object(12)
memory usage: 41.5+ KB

In [848]:

df_n_places_m2_0 = df_n_places_m2_0.merge(df_hotel_m2_0, how='inner', left_on=['rtc'], right_on=['rtc_hotel'])

In [849]:

df_n_places_m2_0[df_n_places_m2_0['rtc_hotel'].notnull()].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 434
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   n_practice                435 non-null    object 
 1   rtc                       435 non-null    object 
 2   name                      0 non-null      object 
 3   category                  435 non-null    object 
 4   address                   435 non-null    object 
 5   street_type               435 non-null    object 
 6   street                    435 non-null    object 
 7   street_number_1           435 non-null    object 
 8   street_letter_1           1 non-null      object 
 9   street_number_2           88 non-null     object 
 10  street_letter_2           0 non-null      object 
 11  block                     0 non-null      object 
 12  entrance                  0 non-null      object 
 13  stair                     0 non-null      object 
 14  floor                     50 non-null     object 
 15  door                      12 non-null     object 
 16  district_code             435 non-null    object 
 17  district_name             435 non-null    object 
 18  neighbourhood_code        435 non-null    object 
 19  neighbourhood_name        435 non-null    object 
 20  longitude                 0 non-null      object 
 21  latitude                  0 non-null      object 
 22  n_places                  435 non-null    float64
 23  street_number_2_hotel     10 non-null     object 
 24  address_hotel             435 non-null    object 
 25  longitude_hotel           435 non-null    object 
 26  latitude_hotel            435 non-null    object 
 27  category_hotel            435 non-null    object 
 28  district_name_hotel       435 non-null    object 
 29  street_number_1_hotel     433 non-null    object 
 30  district_code_hotel       435 non-null    object 
 31  neighbourhood_code_hotel  435 non-null    object 
 32  name_hotel                435 non-null    object 
 33  neighbourhood_name_hotel  435 non-null    object 
 34  rtc_hotel                 435 non-null    object 
dtypes: float64(1), object(34)
memory usage: 122.3+ KB

In [850]:

df_n_places_m2_1 = df_n_places_m2_0.copy()

REMAINING RECORDS

RTC

In [851]:

#RECORDS NOT INCLUDED
df_hotel_remaining = df_hotel_m2_0[(~df_hotel_m2_0['rtc_hotel'].isin(df_n_places_m2_1['rtc_hotel'])) 
                                   | (df_hotel_m2_0['rtc_hotel'].isnull())]
df_hotel_remaining

Out[851]:

street_number_2_hoteladdress_hotellongitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel
4132Avinguda del Tibidabo2.134788944114941741.41358309122107Hotels 5 estr.Sarrià-Sant Gervasi320525Hotel Boutique Mirlo BarcelonaSant Gervasi – la BonanovaHB-004948
142NoneC Nou de la Rambla2.166478732584707741.371682397631936Hotels 3 estr.Sants-Montjuïc1740311Hotel Brumellel Poble-secHB-004690
14784Ronda de Sant Antoni2.16397494742350241.38376584459594Hotels 4 estr.Ciutat Vella840101Hotel Antiga Casa Buenavistael RavalNone
220NoneC Hospital2.16901484977157541.38001981371473Hotels 3 estr.Ciutat Vella1010101Hotel Raval Houseel RavalHB-001213
244NoneAv Diagonal2.108853806831909641.38142529046899Hotels 5 estr.Les Corts6610420Hotel Rey Juan Carlos Ila Maternitat i Sant RamonHB-003961 *Temporalment tancat
37913Carrer de Casp2.169997367549532541.38898611306411Hotels 5 estr.Eixample10207Hotel ME Barcelonala Dreta de l’EixampleHB-004955

In [852]:

#VERIFY RECORDS BEFORE COLUMN DROP
df_n_places_m2_1[(df_n_places_m2_1['rtc']!=df_n_places_m2_1['rtc_hotel']) 
                 & (df_n_places_m2_1['rtc_hotel'].notnull())]

Out[852]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2longitude_hotellatitude_hotelcategory_hoteldistrict_name_hotelstreet_number_1_hoteldistrict_code_hotelneighbourhood_code_hotelname_hotelneighbourhood_name_hotelrtc_hotel

0 rows × 35 columns

In [853]:

df_n_places_m2_1.drop(columns='rtc_hotel', inplace=True)

COMPARISON

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME

In [854]:

df_n_places_m2_2 = df_n_places_m2_1.copy()

In [855]:

df_n_places_m2_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hotel','neighbourhood_name_hotel']][(df_n_places_m2_2['neighbourhood_name']!= df_n_places_m2_2['neighbourhood_name_hotel']) 
                                                                                                                                    & (df_n_places_m2_2['neighbourhood_name_hotel'].notnull())]

Out[855]:

rtcaddressneighbourhood_codeneighbourhood_nameneighbourhood_code_hotelneighbourhood_name_hotel
389HB-002726PLA PALAU 1902el Barri Gòtic04Sant Pere, Santa Caterina i la Ribera

MANUAL VERIFICATION FOR RECORDS ABOVE:

  • THE RIGHT NEIGHBOURHOOD NAME IS IN neighbourhood_name_hotel FOR “9469”

SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm

In [856]:

#FILL IN MISSING VALUES
df_n_places_m2_2.loc[df_n_places_m2_2['rtc']=='HB-002726','neighbourhood_code'] = "04"
df_n_places_m2_2.loc[df_n_places_m2_2['rtc']=='HB-002726','neighbourhood_name'] = "Sant Pere, Santa Caterina i la Ribera"
df_n_places_m2_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hotel','neighbourhood_name_hotel']][df_n_places_m2_2['rtc']=='HB-002726']

Out[856]:

rtcaddressneighbourhood_codeneighbourhood_nameneighbourhood_code_hotelneighbourhood_name_hotel
389HB-002726PLA PALAU 1904Sant Pere, Santa Caterina i la Ribera04Sant Pere, Santa Caterina i la Ribera

In [857]:

df_n_places_m2_2.drop(columns=['neighbourhood_code_hotel','neighbourhood_name_hotel'], inplace=True)

In [858]:

df_n_places_m2_3 = df_n_places_m2_2.copy()

In [859]:

df_n_places_m2_3[['address','district_code','district_name','district_code_hotel','district_name_hotel']][(df_n_places_m2_3['district_name']!= df_n_places_m2_3['district_name_hotel']) 
                                                                                                                                    & (df_n_places_m2_3['district_name_hotel'].notnull())]

Out[859]:

addressdistrict_codedistrict_namedistrict_code_hoteldistrict_name_hotel

In [860]:

df_n_places_m2_3.drop(columns=['district_code_hotel','district_name_hotel'], inplace=True)

COLUMN FILLING

In [861]:

df_n_places_m2_4 = df_n_places_m2_3.copy()

In [862]:

df_n_places_m2_4.columns

Out[862]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'street_number_2_hotel', 'address_hotel', 'longitude_hotel',
       'latitude_hotel', 'category_hotel', 'street_number_1_hotel',
       'name_hotel'],
      dtype='object')

In [863]:

#longitude
df_n_places_m2_4.loc[df_n_places_m2_4['longitude_hotel'].notnull(),'longitude'] = df_n_places_m2_4['longitude_hotel']

In [864]:

#latitude
df_n_places_m2_4.loc[df_n_places_m2_4['latitude_hotel'].notnull(),'latitude'] = df_n_places_m2_4['latitude_hotel']

In [865]:

#name
df_n_places_m2_4.loc[df_n_places_m2_4['name_hotel'].notnull(),'name'] = df_n_places_m2_4['name_hotel']

In [866]:

#DROP 
df_n_places_m2_4.drop(columns=['street_number_2_hotel', 'address_hotel', 'longitude_hotel',
       'latitude_hotel', 'category_hotel', 'street_number_1_hotel',
       'name_hotel'], inplace=True)

In [867]:

df_n_places_m2_4.columns

Out[867]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places'],
      dtype='object')

In [868]:

df_n_places_m2_4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 434
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          435 non-null    object 
 1   rtc                 435 non-null    object 
 2   name                435 non-null    object 
 3   category            435 non-null    object 
 4   address             435 non-null    object 
 5   street_type         435 non-null    object 
 6   street              435 non-null    object 
 7   street_number_1     435 non-null    object 
 8   street_letter_1     1 non-null      object 
 9   street_number_2     88 non-null     object 
 10  street_letter_2     0 non-null      object 
 11  block               0 non-null      object 
 12  entrance            0 non-null      object 
 13  stair               0 non-null      object 
 14  floor               50 non-null     object 
 15  door                12 non-null     object 
 16  district_code       435 non-null    object 
 17  district_name       435 non-null    object 
 18  neighbourhood_code  435 non-null    object 
 19  neighbourhood_name  435 non-null    object 
 20  longitude           435 non-null    object 
 21  latitude            435 non-null    object 
 22  n_places            435 non-null    float64
dtypes: float64(1), object(22)
memory usage: 81.6+ KB

df_hotel_n_places_coordinates

In [869]:

df_hotel_n_places_coordinates = df_n_places_m2_4.copy()

MERGE 3 : df_n_places + df_hut + df_hotel + df_hostel : _m3

In [870]:

df_n_places_m3_0 = df_n_places.copy()
df_n_places_m3_0.head(1)

Out[870]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7nanGRAVINA5None7NoneNoneNone01Ciutat Vella01el RavalNoneNone86.0

1 rows × 23 columns

In [871]:

df_hostel_m3_0 = df_hostel.copy()
df_hostel_m3_0.head(1)

Out[871]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
041.39647766481012.175311353516649Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació346NoneHB-004497Hostal Hostalin Barcelona Diputació

In [872]:

df_hostel_m3_0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 243 entries, 0 to 243
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   latitude_hostel            243 non-null    object
 1   longitude_hostel           243 non-null    object
 2   category_hostel            243 non-null    object
 3   district_code_hostel       243 non-null    object
 4   district_name_hostel       243 non-null    object
 5   neighbourhood_code_hostel  243 non-null    object
 6   neighbourhood_name_hostel  243 non-null    object
 7   address_hostel             243 non-null    object
 8   street_number_1_hostel     243 non-null    object
 9   street_number_2_hostel     243 non-null    object
 10  rtc_hostel                 241 non-null    object
 11  name_hostel                243 non-null    object
dtypes: object(12)
memory usage: 24.7+ KB

In [873]:

df_n_places_m3_0 = df_n_places_m3_0.merge(df_hostel_m3_0, how='inner', left_on=['rtc'], right_on=['rtc_hostel'])

In [874]:

df_n_places_m3_0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 233 entries, 0 to 232
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   n_practice                 233 non-null    object 
 1   rtc                        233 non-null    object 
 2   name                       0 non-null      object 
 3   category                   233 non-null    object 
 4   address                    233 non-null    object 
 5   street_type                233 non-null    object 
 6   street                     233 non-null    object 
 7   street_number_1            233 non-null    object 
 8   street_letter_1            3 non-null      object 
 9   street_number_2            23 non-null     object 
 10  street_letter_2            1 non-null      object 
 11  block                      3 non-null      object 
 12  entrance                   0 non-null      object 
 13  stair                      0 non-null      object 
 14  floor                      155 non-null    object 
 15  door                       99 non-null     object 
 16  district_code              233 non-null    object 
 17  district_name              233 non-null    object 
 18  neighbourhood_code         233 non-null    object 
 19  neighbourhood_name         233 non-null    object 
 20  longitude                  0 non-null      object 
 21  latitude                   0 non-null      object 
 22  n_places                   231 non-null    float64
 23  latitude_hostel            233 non-null    object 
 24  longitude_hostel           233 non-null    object 
 25  category_hostel            233 non-null    object 
 26  district_code_hostel       233 non-null    object 
 27  district_name_hostel       233 non-null    object 
 28  neighbourhood_code_hostel  233 non-null    object 
 29  neighbourhood_name_hostel  233 non-null    object 
 30  address_hostel             233 non-null    object 
 31  street_number_1_hostel     233 non-null    object 
 32  street_number_2_hostel     233 non-null    object 
 33  rtc_hostel                 233 non-null    object 
 34  name_hostel                233 non-null    object 
dtypes: float64(1), object(34)
memory usage: 65.5+ KB

In [875]:

df_n_places_m3_1 = df_n_places_m3_0.copy()

REMAINING RECORDS

RTC

In [876]:

#RECORDS NOT INCLUDED
df_hostel_remaining = df_hostel_m3_0[(~df_hostel_m3_0['rtc_hostel'].isin(df_n_places_m3_1['rtc'])) 
                                     | (df_hostel_m3_0['rtc_hostel'].isnull())]
df_hostel_remaining

Out[876]:

latitude_hostellongitude_hostelcategory_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel
1041.39422670390292.151298840171046Pensions, hostals02Eixample08l’Antiga Esquerra de l’EixampleAv Diagonal433NoneHB-004721Hostal Principal B&BCN
4341.4094424666482152.1831572948146354Pensions, hostals10Sant Martí64el Camp de l’Arpa del ClotC Mallorca537NoneHB-003943Pensió Gimón
4541.3974917472360742.1655695484910837Pensions, hostals02Eixample07la Dreta de l’EixampleC Bruc150NoneHB-004701Ally’s Guest House III
6241.40192760420572.1573692953635994Pensions, hostals06Gràcia31la Vila de GràciaC Torrent de l’Olla95NoneHB-002608Pensió Alberdi
12841.384632759745212.1774675697341084Pensions, hostals01Ciutat Vella02el Barri GòticPl Ramon Berenguer el Gran2NoneHB-004753Hostal The Moods Catedral
13341.428259995705452.1799199184547917Pensions, hostals08Nou Barris44Vilapicina i la Torre LlobetaC Malgrat40NoneHB-004741Hostal Lm Rooms Bcn
17141.3884742837453252.1602212330703807Pensions, hostals02Eixample08l’Antiga Esquerra de l’EixampleC Aragó222NoneHB-004565Tripledos
19641.385249262199112.1699190393632026Pensions, hostals01Ciutat Vella01el RavalC Rambla133NoneHB-001137Pensió Barcelona City Ramblas
22041.374188323768172.165555511750057Pensions, hostals03Sants-Montjuïc11el Poble-secC Poeta Cabanyes18NoneHB-004766Hostal Oliveta
22141.3938880044157142.171481971236943Pensions, hostals02Eixample07la Dreta de l’EixampleC Diputació327NoneNoneHostal Bed & Break
22541.432563718024952.1585842656810272Pensions, hostals07Horta-Guinardó43HortaC Chapí8385HB-004758Hostal Barcelona Nice & Cozy
22741.379195783894792.174445287874625Pensions, hostals01Ciutat Vella01el RavalC Nou de la Rambla1NoneNoneHostal Mimi Las Ramblas

In [877]:

#VERIFY RECORDS BEFORE COLUMN DROP
df_n_places_m3_1[(df_n_places_m3_1['rtc']!=df_n_places_m3_1['rtc_hostel']) & (df_n_places_m3_1['rtc_hostel'].notnull())]

Out[877]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2category_hosteldistrict_code_hosteldistrict_name_hostelneighbourhood_code_hostelneighbourhood_name_hosteladdress_hostelstreet_number_1_hostelstreet_number_2_hostelrtc_hostelname_hostel

0 rows × 35 columns

In [878]:

df_n_places_m3_1.drop(columns='rtc_hostel', inplace=True)

COMPARISON

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME

In [879]:

df_n_places_m3_2 = df_n_places_m3_1.copy()

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm

In [880]:

df_n_places_m3_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hostel','neighbourhood_name_hostel']][(df_n_places_m3_2['neighbourhood_name']!= df_n_places_m3_2['neighbourhood_name_hostel']) 
                                                                                                                                      & (df_n_places_m3_2['neighbourhood_name_hostel'].notnull())]

Out[880]:

rtcaddressneighbourhood_codeneighbourhood_nameneighbourhood_code_hostelneighbourhood_name_hostel

In [881]:

df_n_places_m3_2[['rtc','address','district_code','district_name','district_code_hostel','district_name_hostel']][(df_n_places_m3_2['district_name']!= df_n_places_m3_2['district_name_hostel']) 
                                                                                                                           & (df_n_places_m3_2['district_name_hostel'].notnull())]

Out[881]:

rtcaddressdistrict_codedistrict_namedistrict_code_hosteldistrict_name_hostel

In [882]:

df_n_places_m3_2.drop(columns=['neighbourhood_code_hostel','neighbourhood_name_hostel','district_code_hostel','district_name_hostel'], inplace=True)

In [883]:

df_n_places_m3_3 = df_n_places_m3_2.copy()

COLUMN FILLING

In [884]:

df_n_places_m3_4 = df_n_places_m3_3.copy()

In [885]:

df_n_places_m3_4.columns

Out[885]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'latitude_hostel', 'longitude_hostel', 'category_hostel',
       'address_hostel', 'street_number_1_hostel', 'street_number_2_hostel',
       'name_hostel'],
      dtype='object')

In [886]:

#longitude
df_n_places_m3_4.loc[df_n_places_m3_4['longitude_hostel'].notnull(),'longitude'] = df_n_places_m3_4['longitude_hostel']

In [887]:

#latitude
df_n_places_m3_4.loc[df_n_places_m3_4['latitude_hostel'].notnull(),'latitude'] = df_n_places_m3_4['latitude_hostel']

In [888]:

#name
df_n_places_m3_4.loc[df_n_places_m3_4['name_hostel'].notnull(),'name'] = df_n_places_m3_4['name_hostel']

In [889]:

#DROP 
df_n_places_m3_4.drop(columns=['latitude_hostel', 'longitude_hostel', 'category_hostel',
       'address_hostel', 'street_number_1_hostel', 'street_number_2_hostel',
       'name_hostel'], inplace=True)

In [890]:

df_n_places_m3_4.columns

Out[890]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places'],
      dtype='object')

In [891]:

df_n_places_m3_4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 233 entries, 0 to 232
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          233 non-null    object 
 1   rtc                 233 non-null    object 
 2   name                233 non-null    object 
 3   category            233 non-null    object 
 4   address             233 non-null    object 
 5   street_type         233 non-null    object 
 6   street              233 non-null    object 
 7   street_number_1     233 non-null    object 
 8   street_letter_1     3 non-null      object 
 9   street_number_2     23 non-null     object 
 10  street_letter_2     1 non-null      object 
 11  block               3 non-null      object 
 12  entrance            0 non-null      object 
 13  stair               0 non-null      object 
 14  floor               155 non-null    object 
 15  door                99 non-null     object 
 16  district_code       233 non-null    object 
 17  district_name       233 non-null    object 
 18  neighbourhood_code  233 non-null    object 
 19  neighbourhood_name  233 non-null    object 
 20  longitude           233 non-null    object 
 21  latitude            233 non-null    object 
 22  n_places            231 non-null    float64
dtypes: float64(1), object(22)
memory usage: 43.7+ KB

df_hostel_n_places_coordinates

In [892]:

df_hostel_n_places_coordinates = df_n_places_m3_4.copy()

MERGE 4 : df_n_places + df_hut + df_hotel + df_hostel + df_touristapartments : _m4

In [893]:

df_n_places_m4_0 = df_n_places.copy()
df_n_places_m4_0.head(1)

Out[893]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7nanGRAVINA5None7NoneNoneNone01Ciutat Vella01el RavalNoneNone86.0

1 rows × 23 columns

In [894]:

df_at_m4_0 = df_touristapartment.copy()
df_at_m4_0.head(1)

Out[894]:

latitude_atlongitude_atcategory_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at
041.380967272205212.1532132712982266Apartaments turístics02Eixample09la Nova Esquerra de l’EixampleC Calà bria129NoneApartament Turístic Atenea CalabriaATB-000001

In [895]:

df_n_places_m4_0 = df_n_places_m4_0.merge(df_at_m4_0, how='inner', left_on=['rtc'], right_on=['rtc_at'])

In [896]:

df_n_places_m4_0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 35 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   n_practice             12 non-null     object 
 1   rtc                    12 non-null     object 
 2   name                   0 non-null      object 
 3   category               12 non-null     object 
 4   address                12 non-null     object 
 5   street_type            12 non-null     object 
 6   street                 12 non-null     object 
 7   street_number_1        12 non-null     object 
 8   street_letter_1        0 non-null      object 
 9   street_number_2        2 non-null      object 
 10  street_letter_2        0 non-null      object 
 11  block                  0 non-null      object 
 12  entrance               0 non-null      object 
 13  stair                  0 non-null      object 
 14  floor                  1 non-null      object 
 15  door                   0 non-null      object 
 16  district_code          12 non-null     object 
 17  district_name          12 non-null     object 
 18  neighbourhood_code     12 non-null     object 
 19  neighbourhood_name     12 non-null     object 
 20  longitude              0 non-null      object 
 21  latitude               0 non-null      object 
 22  n_places               12 non-null     float64
 23  latitude_at            12 non-null     object 
 24  longitude_at           12 non-null     object 
 25  category_at            12 non-null     object 
 26  district_code_at       12 non-null     object 
 27  district_name_at       12 non-null     object 
 28  neighbourhood_code_at  12 non-null     object 
 29  neighbourhood_name_at  12 non-null     object 
 30  address_name_at        12 non-null     object 
 31  street_number_1_at     12 non-null     object 
 32  street_number_2_at     1 non-null      object 
 33  name_at                12 non-null     object 
 34  rtc_at                 12 non-null     object 
dtypes: float64(1), object(34)
memory usage: 3.4+ KB

In [897]:

df_n_places_m4_1 = df_n_places_m4_0.copy()

REMAINING RECORDS

RTC

In [898]:

#RECORDS NOT INCLUDED
df_touristapartment_remaining = df_at_m4_0[(~df_at_m4_0['rtc_at'].isin(df_n_places_m4_1['rtc_at'])) 
                                           | (df_at_m4_0['rtc_at'].isnull())]
df_touristapartment_remaining

Out[898]:

latitude_atlongitude_atcategory_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at

In [899]:

#VERIFY RECORDS BEFORE COLUMN DROP
df_n_places_m4_1[(df_n_places_m4_1['rtc']!=df_n_places_m4_1['rtc_at']) & (df_n_places_m4_1['rtc_at'].notnull())]

Out[899]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2category_atdistrict_code_atdistrict_name_atneighbourhood_code_atneighbourhood_name_ataddress_name_atstreet_number_1_atstreet_number_2_atname_atrtc_at

0 rows × 35 columns

In [900]:

df_n_places_m4_1.drop(columns='rtc_at', inplace=True)

COMPARISON

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME

In [901]:

df_n_places_m4_2 = df_n_places_m4_1.copy()

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm

In [902]:

df_n_places_m4_2[['address','neighbourhood_code','neighbourhood_name','neighbourhood_code_at','neighbourhood_name_at']][(df_n_places_m4_2['neighbourhood_name']!= df_n_places_m4_2['neighbourhood_name_at']) 
                                                                                                                        & (df_n_places_m4_2['neighbourhood_name_at'].notnull())]

Out[902]:

addressneighbourhood_codeneighbourhood_nameneighbourhood_code_atneighbourhood_name_at

In [903]:

df_n_places_m4_2[['address','district_code','district_name','district_code_at','district_name_at']][(df_n_places_m4_2['district_name']!= df_n_places_m4_2['district_name_at']) 
                                                                                                             & (df_n_places_m4_2['district_name_at'].notnull())]

Out[903]:

addressdistrict_codedistrict_namedistrict_code_atdistrict_name_at

In [904]:

df_n_places_m4_2.drop(columns=['district_code_at','district_name_at','neighbourhood_code_at','neighbourhood_name_at'], inplace=True)

In [905]:

df_n_places_m4_3 = df_n_places_m4_2.copy()

COLUMN FILLING

In [906]:

df_n_places_m4_4 = df_n_places_m4_3.copy()

In [907]:

df_n_places_m4_4.columns

Out[907]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'latitude_at', 'longitude_at', 'category_at', 'address_name_at',
       'street_number_1_at', 'street_number_2_at', 'name_at'],
      dtype='object')

In [908]:

#longitude
df_n_places_m4_4.loc[df_n_places_m4_4['longitude_at'].notnull(),'longitude'] = df_n_places_m4_4['longitude_at']

In [909]:

#latitude
df_n_places_m4_4.loc[df_n_places_m4_4['latitude_at'].notnull(),'latitude'] = df_n_places_m4_4['latitude_at']

In [910]:

#name
df_n_places_m4_4.loc[df_n_places_m4_4['name_at'].notnull(),'name'] = df_n_places_m4_4['name_at']

In [911]:

#DROP 
df_n_places_m4_4.drop(columns=['latitude_at', 'longitude_at', 'category_at', 'address_name_at',
       'street_number_1_at', 'street_number_2_at', 'name_at'], inplace=True)

In [912]:

df_n_places_m4_4.columns

Out[912]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places'],
      dtype='object')

In [913]:

df_n_places_m4_4.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 0 to 11
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          12 non-null     object 
 1   rtc                 12 non-null     object 
 2   name                12 non-null     object 
 3   category            12 non-null     object 
 4   address             12 non-null     object 
 5   street_type         12 non-null     object 
 6   street              12 non-null     object 
 7   street_number_1     12 non-null     object 
 8   street_letter_1     0 non-null      object 
 9   street_number_2     2 non-null      object 
 10  street_letter_2     0 non-null      object 
 11  block               0 non-null      object 
 12  entrance            0 non-null      object 
 13  stair               0 non-null      object 
 14  floor               1 non-null      object 
 15  door                0 non-null      object 
 16  district_code       12 non-null     object 
 17  district_name       12 non-null     object 
 18  neighbourhood_code  12 non-null     object 
 19  neighbourhood_name  12 non-null     object 
 20  longitude           12 non-null     object 
 21  latitude            12 non-null     object 
 22  n_places            12 non-null     float64
dtypes: float64(1), object(22)
memory usage: 2.2+ KB

df_touristapartment_n_places_coordinates

In [914]:

df_touristapartment_n_places_coordinates = df_n_places_m4_4.copy()

MERGE 5 : df_n_places + df_hut + df_hotel + df_hostel + df_touristapartments + df_albergs : _m5

In [915]:

df_n_places_m5_0 = df_n_places.copy()
df_n_places_m5_0.head(1)

Out[915]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
001-90-A-128HB-003893NoneHotel 3 estrellesGRAVINA 5 7nanGRAVINA5None7NoneNoneNone01Ciutat Vella01el RavalNoneNone86.0

1 rows × 23 columns

In [916]:

df_al_m5_0 = df_alberg.copy()
df_al_m5_0.head(1)

Out[916]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_alrtc_al_modified
641.391161663950022.184353207057929Albergs10Sant Martí66el Parc i la Llacuna del PoblenouC Buenaventura Muñoz16NoneAlberg Arc HouseAJ000645ALB-645

In [917]:

df_n_places_m5_0 = df_n_places_m5_0.merge(df_al_m5_0, how='inner', left_on=['rtc'], right_on=['rtc_al_modified'])

In [918]:

df_n_places_m5_0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 114 entries, 0 to 113
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   n_practice             114 non-null    object 
 1   rtc                    114 non-null    object 
 2   name                   0 non-null      object 
 3   category               114 non-null    object 
 4   address                114 non-null    object 
 5   street_type            114 non-null    object 
 6   street                 114 non-null    object 
 7   street_number_1        114 non-null    object 
 8   street_letter_1        3 non-null      object 
 9   street_number_2        20 non-null     object 
 10  street_letter_2        0 non-null      object 
 11  block                  0 non-null      object 
 12  entrance               0 non-null      object 
 13  stair                  1 non-null      object 
 14  floor                  70 non-null     object 
 15  door                   33 non-null     object 
 16  district_code          114 non-null    object 
 17  district_name          114 non-null    object 
 18  neighbourhood_code     114 non-null    object 
 19  neighbourhood_name     114 non-null    object 
 20  longitude              0 non-null      object 
 21  latitude               0 non-null      object 
 22  n_places               114 non-null    float64
 23  latitude_al            114 non-null    object 
 24  longitude_al           114 non-null    object 
 25  category_al            114 non-null    object 
 26  district_code_al       114 non-null    object 
 27  district_name_al       114 non-null    object 
 28  neighbourhood_code_al  114 non-null    object 
 29  neighbourhood_name_al  114 non-null    object 
 30  address_name_al        114 non-null    object 
 31  street_number_1_al     114 non-null    object 
 32  street_number_2_al     15 non-null     object 
 33  name_al                114 non-null    object 
 34  rtc_al                 114 non-null    object 
 35  rtc_al_modified        114 non-null    object 
dtypes: float64(1), object(35)
memory usage: 33.0+ KB

In [919]:

df_n_places_m5_1 = df_n_places_m5_0.copy()

REMAINING RECORDS

RTC

In [920]:

#RECORDS NOT INCLUDED
df_albergs_remaining = df_al_m5_0[(~df_al_m5_0['rtc_al_modified'].isin(df_n_places_m5_1['rtc_al_modified'])) 
                                  | (df_al_m5_0['rtc_al_modified'].isnull())]
df_albergs_remaining

Out[920]:

latitude_allongitude_alcategory_aldistrict_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_alrtc_al_modified
4241.391400983206382.1578428777063454Albergs02Eixample08l’Antiga Esquerra de l’EixampleC Enric Granados52NoneAfter HostelAJ000543ALB-543
6041.399796179594052.118829605855649Albergs05Sarrià-Sant Gervasi23SarriàC Duquessa d’Orleans56NoneAlberg Studio HostelAJ000015ALB-015
9741.38051857652982.1707195328025053Albergs01Ciutat Vella01el RavalC Hospital63NoneAlberg Center RamblesAJ000398ALB-398
10141.388632420569852.144502668709701Albergs02Eixample09la Nova Esquerra de l’EixampleC Londres20NoneFree Hostels BarcelonaAJ000614ALB-614
11641.3915457083610362.1615865918643533Albergs02Eixample07la Dreta de l’EixampleC València233NoneTierra Azul HostelAJ000557ALB-557
14141.4161894968321752.1468663071393794Albergs06Gràcia28Vallcarca i els PenitentsPg Mare de Déu del Coll4151Casa MarsansAlberg Mare de Déu de Montserrat – AJ000084Alberg Mare de Déu de Montserrat – ALB-084
17741.381451642379752.175475468533943Albergs01Ciutat Vella02el Barri GòticC Ferran31NoneAlberg FernandoAJ000419ALB-419
18341.388118060970952.13384878071623Albergs04Les Corts19les CortsC Numà ncia149151Alberg Pere TarrésAJ000070ALB-070

In [921]:

#VERIFY RECORDS BEFORE COLUMN DROP
df_n_places_m5_1[(df_n_places_m5_1['rtc']!=df_n_places_m5_1['rtc_al_modified']) & (df_n_places_m5_1['rtc_al_modified'].notnull())]

Out[921]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2district_code_aldistrict_name_alneighbourhood_code_alneighbourhood_name_aladdress_name_alstreet_number_1_alstreet_number_2_alname_alrtc_alrtc_al_modified

0 rows × 36 columns

In [922]:

df_n_places_m5_1.drop(columns=['rtc_al','rtc_al_modified'], inplace=True)

COMPARISON

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME

In [923]:

df_n_places_m5_2 = df_n_places_m5_1.copy()

DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm

In [924]:

df_n_places_m5_2[['address','neighbourhood_code','neighbourhood_name','neighbourhood_code_al','neighbourhood_name_al']][(df_n_places_m5_2['neighbourhood_name']!= df_n_places_m5_2['neighbourhood_name_al']) 
                                                                                                                        & (df_n_places_m5_2['neighbourhood_name_al'].notnull())]

Out[924]:

addressneighbourhood_codeneighbourhood_nameneighbourhood_code_alneighbourhood_name_al

In [925]:

df_n_places_m5_2[['address','district_code','district_name','district_code_al','district_name_al']][(df_n_places_m5_2['district_name']!= df_n_places_m5_2['district_name_al']) 
                                                                                                             & (df_n_places_m5_2['district_name_al'].notnull())]

Out[925]:

addressdistrict_codedistrict_namedistrict_code_aldistrict_name_al

In [926]:

df_n_places_m5_2.drop(columns=['district_code_al','district_name_al','neighbourhood_code_al','neighbourhood_name_al'], inplace=True)

In [927]:

df_n_places_m5_3 = df_n_places_m5_2.copy()

COLUMN FILLING

In [928]:

df_n_places_m5_3.columns

Out[928]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places',
       'latitude_al', 'longitude_al', 'category_al', 'address_name_al',
       'street_number_1_al', 'street_number_2_al', 'name_al'],
      dtype='object')

In [929]:

#longitude
df_n_places_m5_3.loc[df_n_places_m5_3['longitude_al'].notnull(),'longitude'] = df_n_places_m5_3['longitude_al']

In [930]:

#latitude
df_n_places_m5_3.loc[df_n_places_m5_3['latitude_al'].notnull(),'latitude'] = df_n_places_m5_3['latitude_al']

In [931]:

#name
df_n_places_m5_3.loc[df_n_places_m5_3['name_al'].notnull(),'name'] = df_n_places_m5_3['name_al']

In [932]:

#DROP 
df_n_places_m5_3.drop(columns=['latitude_al', 'longitude_al', 'category_al', 'address_name_al',
       'street_number_1_al', 'street_number_2_al', 'name_al'], inplace=True)

In [933]:

df_n_places_m5_3.columns

Out[933]:

Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type',
       'street', 'street_number_1', 'street_letter_1', 'street_number_2',
       'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door',
       'district_code', 'district_name', 'neighbourhood_code',
       'neighbourhood_name', 'longitude', 'latitude', 'n_places'],
      dtype='object')

In [934]:

df_n_places_m5_3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 114 entries, 0 to 113
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          114 non-null    object 
 1   rtc                 114 non-null    object 
 2   name                114 non-null    object 
 3   category            114 non-null    object 
 4   address             114 non-null    object 
 5   street_type         114 non-null    object 
 6   street              114 non-null    object 
 7   street_number_1     114 non-null    object 
 8   street_letter_1     3 non-null      object 
 9   street_number_2     20 non-null     object 
 10  street_letter_2     0 non-null      object 
 11  block               0 non-null      object 
 12  entrance            0 non-null      object 
 13  stair               1 non-null      object 
 14  floor               70 non-null     object 
 15  door                33 non-null     object 
 16  district_code       114 non-null    object 
 17  district_name       114 non-null    object 
 18  neighbourhood_code  114 non-null    object 
 19  neighbourhood_name  114 non-null    object 
 20  longitude           114 non-null    object 
 21  latitude            114 non-null    object 
 22  n_places            114 non-null    float64
dtypes: float64(1), object(22)
memory usage: 21.4+ KB

df_albergs_n_places_coordinates

In [935]:

df_albergs_n_places_coordinates = df_n_places_m5_3.copy()

FINAL DATAFRAME: TOURIST ESTABLISHMENTS COORDINATES AND N_PLACES

CONCATENATE DATAFRAMES

In [936]:

df_hut_n_places_coordinates.shape[0]

Out[936]:

9409

In [937]:

df_hotel_n_places_coordinates.shape[0]

Out[937]:

435

In [938]:

df_hostel_n_places_coordinates.shape[0]

Out[938]:

233

In [939]:

df_touristapartment_n_places_coordinates.shape[0]

Out[939]:

12

In [940]:

df_albergs_n_places_coordinates.shape[0]

Out[940]:

114

In [941]:

dataframe_list_concat = [df_hut_n_places_coordinates,
                         df_hotel_n_places_coordinates,
                         df_hostel_n_places_coordinates,
                         df_touristapartment_n_places_coordinates,
                         df_albergs_n_places_coordinates]
df_final_cleaning = pd.DataFrame()
for i in dataframe_list_concat:
    df_final_cleaning = pd.concat((df_final_cleaning,i), ignore_index=True)

In [942]:

df_final_cleaning1 = df_final_cleaning.copy()
df_final_cleaning1.shape[0]

Out[942]:

10203

DUPLICATES

In [943]:

df_final_cleaning1.duplicated(keep=False).value_counts()

Out[943]:

False    10203
dtype: int64

In [944]:

#CHECK ON ID COLUMNS
df_final_cleaning1[df_final_cleaning1.duplicated(subset=['n_practice','rtc'], keep=False)].sort_values('name')

Out[944]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [945]:

#CHECK ON SINGLE ID COLUMN
df_final_cleaning1[df_final_cleaning1.duplicated(subset=['n_practice'], keep=False)].sort_values('name')

Out[945]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
1012306-2010-0423ALB-556Alberg Generator BarcelonaAlbergsC CORSEGA 373 375CCORSEGA373None375NoneNoneNone06Gràcia31la Vila de Gràcia2.16263412979599241.39918207998624646.0
1018706-2017-0212ALB-565Casa Gracia Barcelona HostelAlbergsPG GRACIA 116PGGRACIA116NoneNoneNoneNoneNone06Gràcia31la Vila de Gràcia2.15912725434730641.397281369714406446.0
1005406-2017-0212HB-004682Hostal Casa Grà ciaHotel 1 estrellaPG GRACIA 116PGGRACIA116NoneNoneNoneNoneNone06Gràcia31la Vila de Gràcia2.15912725434730641.39728136971440623.0
992106-2010-0423HB-004525Hostal Generator BarcelonaHotel 1 estrellaC CORSEGA 373 375CCORSEGA373None375NoneNoneNone06Gràcia31la Vila de Gràcia2.162691438405667541.39922620097491581.0
964210-2001-0694HB-004358Hotel Melia Barcelona SkyHotel 4 estrelles superiorC PERE IV 272CPERE IV272NoneNoneNoneNoneNone10Sant Martí68el Poblenou2.20063479775435641.406288190178934430.0
974202-2015-0048HB-004629Hotel TOC Hostel BarcelonaHotel 1 estrellaG.V. CORTS CATALANES 580 BJG.V.CORTS CATALANES580NoneNoneNoneBJNone02Eixample10Sant Antoni2.16250454967639141.3847827057068142.0
964110-2001-0694HB-004532Hotel The Level at Melia Barcelona SkyHotel 5 estrellesC PERE IV 272CPERE IV272NoneNoneNoneNoneNone10Sant Martí68el Poblenou2.201019518710454641.40624733873393486.0
1018402-2015-0048ALB-625Toc Hostel BarcelonaAlbergsG.V. CORTS CATALANES 580 BJG.V.CORTS CATALANES580NoneNoneNoneBJNone02Eixample10Sant Antoni2.16250454967639141.38478270570681216.0

8 rows × 23 columns

The records above appear to be related to cases where establishments have sections belonging to different categories within the same establishment – therefore, these are not dropped

In [946]:

#CHECK ON SINGLE ID COLUMN - WITH 'PENDENT' (PENDING) VALUES EXCLUDED
df_final_cleaning1[(df_final_cleaning1.duplicated(subset=['rtc'], keep=False)) & (df_final_cleaning1['rtc']!= 'Pendent')].sort_values('name')

Out[946]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
993303-1999-0004HB-000957Pensió IniestaPensióC FONTRODONA 1 1 1CFONTRODONA1NoneNoneNone1103Sants-Montjuïc11el Poble-sec2.16783186592865141.374727303453646NaN
993403-1998-0472HB-000957Pensió IniestaPensióC FONTRODONA 1 2 3CFONTRODONA1NoneNoneNone2303Sants-Montjuïc11el Poble-sec2.16783186592865141.374727303453646NaN
993503-2002-0222HB-000957Pensió IniestaPensióC FONTRODONA 1 EN 3CFONTRODONA1NoneNoneNoneEN303Sants-Montjuïc11el Poble-sec2.16783186592865141.37472730345364623.0
1009005-2013-0284ALB-562Wow Hostel BarcelonaAlbergsAV DIAGONAL 578 3AVDIAGONAL578NoneNoneNone3None05Sarrià-Sant Gervasi26Sant Gervasi – Galvany2.14821320021261541.3939397417392519.0
1009105-2016-0268ALB-562Wow Hostel BarcelonaAlbergsAV DIAGONAL 578 5AVDIAGONAL578NoneNoneNone5None05Sarrià-Sant Gervasi26Sant Gervasi – Galvany2.14821320021261541.3939397417392519.0

5 rows × 23 columns

The records above refer to different floors within the same establishments.
However, only in one case the variable of interest – n_places – is indicated for all floors.
In the other cases, the varible of interest – n_places – is indicated only in one of the floors.

Records are not modified.

MISSING VALUES

In [947]:

df_final_cleaning2 = df_final_cleaning1.copy()

In [948]:

df_final_cleaning2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          10203 non-null  object 
 1   rtc                 10203 non-null  object 
 2   name                10203 non-null  object 
 3   category            10203 non-null  object 
 4   address             10203 non-null  object 
 5   street_type         10203 non-null  object 
 6   street              10203 non-null  object 
 7   street_number_1     10203 non-null  object 
 8   street_letter_1     123 non-null    object 
 9   street_number_2     880 non-null    object 
 10  street_letter_2     4 non-null      object 
 11  block               13 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               690 non-null    object 
 14  floor               9654 non-null   object 
 15  door                8672 non-null   object 
 16  district_code       10203 non-null  object 
 17  district_name       10203 non-null  object 
 18  neighbourhood_code  10203 non-null  object 
 19  neighbourhood_name  10203 non-null  object 
 20  longitude           10203 non-null  object 
 21  latitude            10203 non-null  object 
 22  n_places            10201 non-null  float64
dtypes: float64(1), object(22)
memory usage: 1.8+ MB

N_PLACES

In [949]:

df_final_cleaning2[(df_final_cleaning2['n_places'].isnull()) 
                   | (df_final_cleaning2['n_places']=='') 
                   | (df_final_cleaning2['n_places']=='nan')
                   | (df_final_cleaning2['n_places']==None)]

Out[949]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
993303-1999-0004HB-000957Pensió IniestaPensióC FONTRODONA 1 1 1CFONTRODONA1NoneNoneNone1103Sants-Montjuïc11el Poble-sec2.16783186592865141.374727303453646NaN
993403-1998-0472HB-000957Pensió IniestaPensióC FONTRODONA 1 2 3CFONTRODONA1NoneNoneNone2303Sants-Montjuïc11el Poble-sec2.16783186592865141.374727303453646NaN

2 rows × 23 columns

LONGITUDE

In [950]:

df_final_cleaning2[(df_final_cleaning2['longitude'].isnull()) 
                   | (df_final_cleaning2['longitude']=='') 
                   | (df_final_cleaning2['longitude']=='nan')
                   | (df_final_cleaning2['longitude']==None)]

Out[950]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

LATITUDE

In [951]:

df_final_cleaning2[(df_final_cleaning2['latitude'].isnull()) 
                   | (df_final_cleaning2['latitude']=='') 
                   | (df_final_cleaning2['latitude']=='nan')
                   | (df_final_cleaning2['latitude']==None)]

Out[951]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

DISTRICT – NEIGHBOURHOOD

In [952]:

df_final_cleaning2[(df_final_cleaning2['neighbourhood_code'].isnull()) 
                   | (df_final_cleaning2['neighbourhood_code']=='') 
                   | (df_final_cleaning2['neighbourhood_code']=='nan')
                   | (df_final_cleaning2['neighbourhood_code']==None)]

Out[952]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [953]:

df_final_cleaning2[(df_final_cleaning2['district_code'].isnull()) 
                   | (df_final_cleaning2['district_code']=='') 
                   | (df_final_cleaning2['district_code']=='nan')
                   | (df_final_cleaning2['district_code']==None)]

Out[953]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places

0 rows × 23 columns

In [954]:

df_final_cleaning2.head()

Out[954]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
007-2013-0168HUTB-007570Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 22 3 1AVCAN BARO22NoneNoneNone3107Horta-Guinardó34Can Baró2.16071988641.414059953.0
107-2014-0121HUTB-009724Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 2AVCAN BARO3NoneNoneNone1207Horta-Guinardó34Can Baró2.16000264841.413494844.0
207-2014-0161HUTB-010707Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 3AVCAN BARO3NoneNoneNone1307Horta-Guinardó34Can Baró2.16000264841.413494844.0
307-2014-0120HUTB-009725Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 4AVCAN BARO3NoneNoneNone1407Horta-Guinardó34Can Baró2.16000264841.413494844.0
407-2012-0231HUTB-002942Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 PR 2AVCAN BARO3NoneNoneNonePR207Horta-Guinardó34Can Baró2.16000264841.413494842.0

5 rows × 23 columns

In [955]:

df_final_cleaning3 = df_final_cleaning2.copy()

NORMALIZATION

In [956]:

df_final_cleaning3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10203 entries, 0 to 10202
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   n_practice          10203 non-null  object 
 1   rtc                 10203 non-null  object 
 2   name                10203 non-null  object 
 3   category            10203 non-null  object 
 4   address             10203 non-null  object 
 5   street_type         10203 non-null  object 
 6   street              10203 non-null  object 
 7   street_number_1     10203 non-null  object 
 8   street_letter_1     123 non-null    object 
 9   street_number_2     880 non-null    object 
 10  street_letter_2     4 non-null      object 
 11  block               13 non-null     object 
 12  entrance            3 non-null      object 
 13  stair               690 non-null    object 
 14  floor               9654 non-null   object 
 15  door                8672 non-null   object 
 16  district_code       10203 non-null  object 
 17  district_name       10203 non-null  object 
 18  neighbourhood_code  10203 non-null  object 
 19  neighbourhood_name  10203 non-null  object 
 20  longitude           10203 non-null  object 
 21  latitude            10203 non-null  object 
 22  n_places            10201 non-null  float64
dtypes: float64(1), object(22)
memory usage: 1.8+ MB

In [957]:

df_final = df_final_cleaning3.copy()
df_final.head()

Out[957]:

n_practicertcnamecategoryaddressstreet_typestreetstreet_number_1street_letter_1street_number_2stairfloordoordistrict_codedistrict_nameneighbourhood_codeneighbourhood_namelongitudelatituden_places
007-2013-0168HUTB-007570Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 22 3 1AVCAN BARO22NoneNoneNone3107Horta-Guinardó34Can Baró2.16071988641.414059953.0
107-2014-0121HUTB-009724Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 2AVCAN BARO3NoneNoneNone1207Horta-Guinardó34Can Baró2.16000264841.413494844.0
207-2014-0161HUTB-010707Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 3AVCAN BARO3NoneNoneNone1307Horta-Guinardó34Can Baró2.16000264841.413494844.0
307-2014-0120HUTB-009725Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 1 4AVCAN BARO3NoneNoneNone1407Horta-Guinardó34Can Baró2.16000264841.413494844.0
407-2012-0231HUTB-002942Habitatges d’Ús TurísticHabitatges d’Ús TurísticAV CAN BARO 3 PR 2AVCAN BARO3NoneNoneNonePR207Horta-Guinardó34Can Baró2.16000264841.413494842.0

5 rows × 23 columns

DATA EXPORT TO EXCEL FILE

In [958]:

#NAME EXCEL EXPORT FILE
year = 2022
prefix = 'TOURIST_LODGINGS_PD_ONLY'
excel_file_name = 'T_{}_YEAR_{}.xlsx'.format(prefix,year)
excel_sheet_name = '{}'.format(prefix)
excel_index_label = '{}_INDEX'.format(prefix)
df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)