Python + SQL
Pandas, Requests, Beautiful Soup, Google Big Query, DuckDB
SQL – BIGQUERY – JUPYTER LAB – Barcelona Number of Foreign Tourists by Neighbourhood per Day
SQL – BIGQUERY¶
Number of National and Foreign TOURISTS per DAY, Lodging Category and Neighbourhood in 2022
REQUIRED LIBRARIES
In [103]:
import pandas as pd from pandas.io import gbq import duckdb
In [2]:
project_id=''
INFORMATION_SCHEMA.SCHEMATA¶
In [3]:
df_information_schema_schemata = gbq.read_gbq(query='SELECT schema_name FROM region-us.INFORMATION_SCHEMA.SCHEMATA;', project_id=project_id) df_information_schema_schemata
Out[3]:
schema_name | |
---|---|
0 | TOURISM |
1 | REAL_ESTATE |
2 | DICTIONARY |
3 | DIVERSITY |
INFORMATION_SCHEMA.TABLES – TOURISM¶
In [4]:
df_information_schema_tables_tourism = gbq.read_gbq(query='SELECT table_name FROM TOURISM.INFORMATION_SCHEMA.TABLES;') df_information_schema_tables_tourism
Out[4]:
table_name | |
---|---|
0 | TL |
1 | TT |
2 | FT |
In [5]:
df_information_schema_tables_tourism_tl = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "TL";') df_information_schema_tables_tourism_tl
Out[5]:
table_schema | table_name | column_name | data_type | |
---|---|---|---|---|
0 | TOURISM | TL | TOURIST_LODGINGS_PD_ONLY_INDEX | INT64 |
1 | TOURISM | TL | n_practice | STRING |
2 | TOURISM | TL | rtc | STRING |
3 | TOURISM | TL | name | STRING |
4 | TOURISM | TL | category | STRING |
5 | TOURISM | TL | address | STRING |
6 | TOURISM | TL | street_type | STRING |
7 | TOURISM | TL | street | STRING |
8 | TOURISM | TL | street_number_1 | INT64 |
9 | TOURISM | TL | street_letter_1 | STRING |
10 | TOURISM | TL | street_number_2 | INT64 |
11 | TOURISM | TL | street_letter_2 | STRING |
12 | TOURISM | TL | block | STRING |
13 | TOURISM | TL | entrance | STRING |
14 | TOURISM | TL | stair | STRING |
15 | TOURISM | TL | floor | STRING |
16 | TOURISM | TL | door | STRING |
17 | TOURISM | TL | district_code | INT64 |
18 | TOURISM | TL | district_name | STRING |
19 | TOURISM | TL | neighbourhood_code | INT64 |
20 | TOURISM | TL | neighbourhood_name | STRING |
21 | TOURISM | TL | longitude | FLOAT64 |
22 | TOURISM | TL | latitude | FLOAT64 |
23 | TOURISM | TL | n_places | INT64 |
In [6]:
df_information_schema_tables_tourism_tt = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "TT";') df_information_schema_tables_tourism_tt
Out[6]:
table_schema | table_name | column_name | data_type | |
---|---|---|---|---|
0 | TOURISM | TT | year | INT64 |
1 | TOURISM | TT | month | INT64 |
2 | TOURISM | TT | lodging_type | STRING |
3 | TOURISM | TT | n_tourists | INT64 |
4 | TOURISM | TT | overnight_stays | INT64 |
5 | TOURISM | TT | average_lenght_stay | FLOAT64 |
In [7]:
df_information_schema_tables_tourism_ft = gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "FT";') df_information_schema_tables_tourism_ft
Out[7]:
table_schema | table_name | column_name | data_type | |
---|---|---|---|---|
0 | TOURISM | FT | FT_INDEX | INT64 |
1 | TOURISM | FT | FT_YEAR | INT64 |
2 | TOURISM | FT | FT_MONTH | INT64 |
3 | TOURISM | FT | FT_COUNTRY | STRING |
4 | TOURISM | FT | FT_N_TOURISTS | INT64 |
SOURCE DATAFRAMES¶
In [8]:
df_tourism_tl = gbq.read_gbq(query='SELECT * FROM TOURISM.TL;') df_tourism_tl.head(1)
Out[8]:
TOURIST_LODGINGS_PD_ONLY_INDEX | n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9409 | 01-90-A-128 | HB-003893 | Hotel Reding Croma | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | … | None | None | None | 1 | Ciutat Vella | 1 | el Raval | 2.165669 | 41.385328 | 86 |
1 rows × 24 columns
In [9]:
df_tl = df_tourism_tl.copy()
In [10]:
df_tourism_tt = gbq.read_gbq(query='SELECT * FROM TOURISM.TT;') df_tourism_tt.head(1)
Out[10]:
year | month | lodging_type | n_tourists | overnight_stays | average_lenght_stay | |
---|---|---|---|---|---|---|
0 | 2022 | 1 | hotel | 301474 | 745224 | 2.471935 |
In [11]:
df_tt = df_tourism_tt.copy()
In [12]:
df_tourism_ft = gbq.read_gbq(query='SELECT * FROM TOURISM.FT;') df_tourism_ft.head(1)
Out[12]:
FT_INDEX | FT_YEAR | FT_MONTH | FT_COUNTRY | FT_N_TOURISTS | |
---|---|---|---|---|---|
0 | 0 | 2022 | 1 | France | 39625 |
In [13]:
df_ft = df_tourism_ft.copy()
SQL¶
ESTIMATING THE NUMBER OF NATIONAL AND FOREIGN TOURISTS PER DAY IN 2022
SOURCE:
https://www.observatoriturisme.barcelona/en/destination-barcelona-tourism-activity-latest-data
In [107]:
q = """ SELECT year as 'year_x',month as 'month_x',SUM(n_tourists) as 'TOTAL_N_TOURISTS_PER_MONTH' FROM df_tt GROUP BY month, year ;""" df_total_n_tourists_per_month_2022 = duckdb.query(q).df() df_total_n_tourists_per_month_2022
Out[107]:
year_x | month_x | TOTAL_N_TOURISTS_PER_MONTH | |
---|---|---|---|
0 | 2022 | 1 | 427243.0 |
1 | 2022 | 2 | 559050.0 |
2 | 2022 | 3 | 694636.0 |
3 | 2022 | 4 | 858276.0 |
4 | 2022 | 5 | 907574.0 |
5 | 2022 | 6 | 910784.0 |
6 | 2022 | 7 | 1025155.0 |
7 | 2022 | 8 | 982400.0 |
8 | 2022 | 9 | 926717.0 |
9 | 2022 | 10 | 939317.0 |
10 | 2022 | 11 | 773737.0 |
11 | 2022 | 12 | 738847.0 |
In [108]:
q = """ SELECT FT_YEAR,FT_MONTH,SUM(FT_N_TOURISTS) as 'TOTAL_N_FOREIGN_TOURISTS_PER_MONTH' FROM df_ft WHERE FT_YEAR=2022 GROUP BY FT_YEAR, FT_MONTH ORDER BY FT_YEAR, FT_MONTH ;""" df_n_foreign_tourists_per_month_2022 = duckdb.query(q).df() df_n_foreign_tourists_per_month_2022
Out[108]:
FT_YEAR | FT_MONTH | TOTAL_N_FOREIGN_TOURISTS_PER_MONTH | |
---|---|---|---|
0 | 2022 | 1 | 212111.0 |
1 | 2022 | 2 | 239221.0 |
2 | 2022 | 3 | 349409.0 |
3 | 2022 | 4 | 485751.0 |
4 | 2022 | 5 | 608404.0 |
5 | 2022 | 6 | 588006.0 |
6 | 2022 | 7 | 595319.0 |
7 | 2022 | 8 | 553449.0 |
8 | 2022 | 9 | 575061.0 |
9 | 2022 | 10 | 628197.0 |
10 | 2022 | 11 | 520641.0 |
11 | 2022 | 12 | 359615.0 |
In [110]:
q = """ SELECT year,month,lodging_type,n_tourists AS 'N_TOURISTS_PER_LODGING_TYPE_PER_MONTH',overnight_stays,average_lenght_stay,TOTAL_N_TOURISTS_PER_MONTH, n_tourists/TOTAL_N_TOURISTS_PER_MONTH*100 as 'LODGING_TYPE_SHARE' FROM df_tt LEFT JOIN df_total_n_tourists_per_month_2022 ON month=month_x ;""" df_lodging_share_total_n_tourists_2022 = duckdb.query(q).df() df_lodging_share_total_n_tourists_2022
Out[110]:
year | month | lodging_type | N_TOURISTS_PER_LODGING_TYPE_PER_MONTH | overnight_stays | average_lenght_stay | TOTAL_N_TOURISTS_PER_MONTH | LODGING_TYPE_SHARE | |
---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | hotel | 301474 | 745224 | 2.471935 | 427243.0 | 70.562654 |
1 | 2022 | 2 | hotel | 423648 | 1044943 | 2.466536 | 559050.0 | 75.779984 |
2 | 2022 | 3 | hotel | 541594 | 1428371 | 2.637346 | 694636.0 | 77.968029 |
3 | 2022 | 4 | hotel | 663354 | 1794084 | 2.704565 | 858276.0 | 77.289124 |
4 | 2022 | 5 | hotel | 696487 | 1902416 | 2.731445 | 907574.0 | 76.741621 |
5 | 2022 | 6 | hotel | 701128 | 1945168 | 2.774341 | 910784.0 | 76.980711 |
6 | 2022 | 7 | hotel | 752240 | 2091958 | 2.780971 | 1025155.0 | 73.378172 |
7 | 2022 | 8 | hotel | 720944 | 2114854 | 2.933451 | 982400.0 | 73.385993 |
8 | 2022 | 9 | hotel | 710386 | 1851581 | 2.606444 | 926717.0 | 76.656196 |
9 | 2022 | 10 | hotel | 736153 | 1947345 | 2.645299 | 939317.0 | 78.371093 |
10 | 2022 | 11 | hotel | 590686 | 1511602 | 2.559062 | 773737.0 | 76.341961 |
11 | 2022 | 12 | hotel | 543525 | 1356166 | 2.495131 | 738847.0 | 73.563945 |
12 | 2022 | 1 | homes_for_tourist_use | 125769 | 618569 | 4.918295 | 427243.0 | 29.437346 |
13 | 2022 | 2 | homes_for_tourist_use | 135402 | 614543 | 4.538655 | 559050.0 | 24.220016 |
14 | 2022 | 3 | homes_for_tourist_use | 153042 | 625055 | 4.084206 | 694636.0 | 22.031971 |
15 | 2022 | 4 | homes_for_tourist_use | 194922 | 805706 | 4.133479 | 858276.0 | 22.710876 |
16 | 2022 | 5 | homes_for_tourist_use | 211087 | 791349 | 3.748923 | 907574.0 | 23.258379 |
17 | 2022 | 6 | homes_for_tourist_use | 209656 | 858558 | 4.095080 | 910784.0 | 23.019289 |
18 | 2022 | 7 | homes_for_tourist_use | 272915 | 1038668 | 3.805830 | 1025155.0 | 26.621828 |
19 | 2022 | 8 | homes_for_tourist_use | 261456 | 1136612 | 4.347240 | 982400.0 | 26.614007 |
20 | 2022 | 9 | homes_for_tourist_use | 216331 | 886061 | 4.095858 | 926717.0 | 23.343804 |
21 | 2022 | 10 | homes_for_tourist_use | 203164 | 900796 | 4.433837 | 939317.0 | 21.628907 |
22 | 2022 | 11 | homes_for_tourist_use | 183051 | 819159 | 4.475032 | 773737.0 | 23.658039 |
23 | 2022 | 12 | homes_for_tourist_use | 195322 | 966802 | 4.949785 | 738847.0 | 26.436055 |
In [111]:
q = """ SELECT year,month,lodging_type,N_TOURISTS_PER_LODGING_TYPE_PER_MONTH,overnight_stays,average_lenght_stay,LODGING_TYPE_SHARE,TOTAL_N_TOURISTS_PER_MONTH, TOTAL_N_TOURISTS_PER_MONTH-TOTAL_N_FOREIGN_TOURISTS_PER_MONTH AS 'TOTAL_N_NATIONAL_TOURISTS_PER_MONTH', TOTAL_N_FOREIGN_TOURISTS_PER_MONTH, (LODGING_TYPE_SHARE/100*TOTAL_N_NATIONAL_TOURISTS_PER_MONTH) AS 'N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH', (LODGING_TYPE_SHARE/100*TOTAL_N_FOREIGN_TOURISTS_PER_MONTH) AS 'N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH' FROM df_lodging_share_total_n_tourists_2022, df_n_foreign_tourists_per_month_2022 WHERE month=FT_MONTH AND year=FT_YEAR ;""" df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022 = duckdb.query(q).df() df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022
Out[111]:
year | month | lodging_type | N_TOURISTS_PER_LODGING_TYPE_PER_MONTH | overnight_stays | average_lenght_stay | LODGING_TYPE_SHARE | TOTAL_N_TOURISTS_PER_MONTH | TOTAL_N_NATIONAL_TOURISTS_PER_MONTH | TOTAL_N_FOREIGN_TOURISTS_PER_MONTH | N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH | N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | hotel | 301474 | 745224 | 2.471935 | 70.562654 | 427243.0 | 215132.0 | 212111.0 | 151802.848889 | 149671.151111 |
1 | 2022 | 2 | hotel | 423648 | 1044943 | 2.466536 | 75.779984 | 559050.0 | 319829.0 | 239221.0 | 242366.364712 | 181281.635288 |
2 | 2022 | 3 | hotel | 541594 | 1428371 | 2.637346 | 77.968029 | 694636.0 | 345227.0 | 349409.0 | 269166.688507 | 272427.311493 |
3 | 2022 | 4 | hotel | 663354 | 1794084 | 2.704565 | 77.289124 | 858276.0 | 372525.0 | 485751.0 | 287921.308355 | 375432.691645 |
4 | 2022 | 5 | hotel | 696487 | 1902416 | 2.731445 | 76.741621 | 907574.0 | 299170.0 | 608404.0 | 229587.907752 | 466899.092248 |
5 | 2022 | 6 | hotel | 701128 | 1945168 | 2.774341 | 76.980711 | 910784.0 | 322778.0 | 588006.0 | 248476.799751 | 452651.200249 |
6 | 2022 | 7 | hotel | 752240 | 2091958 | 2.780971 | 73.378172 | 1025155.0 | 429836.0 | 595319.0 | 315405.799747 | 436834.200253 |
7 | 2022 | 8 | hotel | 720944 | 2114854 | 2.933451 | 73.385993 | 982400.0 | 428951.0 | 553449.0 | 314789.952915 | 406154.047085 |
8 | 2022 | 9 | hotel | 710386 | 1851581 | 2.606444 | 76.656196 | 926717.0 | 351656.0 | 575061.0 | 269566.112649 | 440819.887351 |
9 | 2022 | 10 | hotel | 736153 | 1947345 | 2.645299 | 78.371093 | 939317.0 | 311120.0 | 628197.0 | 243828.144663 | 492324.855337 |
10 | 2022 | 11 | hotel | 590686 | 1511602 | 2.559062 | 76.341961 | 773737.0 | 253096.0 | 520641.0 | 193218.450011 | 397467.549989 |
11 | 2022 | 12 | hotel | 543525 | 1356166 | 2.495131 | 73.563945 | 738847.0 | 379232.0 | 359615.0 | 278978.019536 | 264546.980464 |
12 | 2022 | 1 | homes_for_tourist_use | 125769 | 618569 | 4.918295 | 29.437346 | 427243.0 | 215132.0 | 212111.0 | 63329.151111 | 62439.848889 |
13 | 2022 | 2 | homes_for_tourist_use | 135402 | 614543 | 4.538655 | 24.220016 | 559050.0 | 319829.0 | 239221.0 | 77462.635288 | 57939.364712 |
14 | 2022 | 3 | homes_for_tourist_use | 153042 | 625055 | 4.084206 | 22.031971 | 694636.0 | 345227.0 | 349409.0 | 76060.311493 | 76981.688507 |
15 | 2022 | 4 | homes_for_tourist_use | 194922 | 805706 | 4.133479 | 22.710876 | 858276.0 | 372525.0 | 485751.0 | 84603.691645 | 110318.308355 |
16 | 2022 | 5 | homes_for_tourist_use | 211087 | 791349 | 3.748923 | 23.258379 | 907574.0 | 299170.0 | 608404.0 | 69582.092248 | 141504.907752 |
17 | 2022 | 6 | homes_for_tourist_use | 209656 | 858558 | 4.095080 | 23.019289 | 910784.0 | 322778.0 | 588006.0 | 74301.200249 | 135354.799751 |
18 | 2022 | 7 | homes_for_tourist_use | 272915 | 1038668 | 3.805830 | 26.621828 | 1025155.0 | 429836.0 | 595319.0 | 114430.200253 | 158484.799747 |
19 | 2022 | 8 | homes_for_tourist_use | 261456 | 1136612 | 4.347240 | 26.614007 | 982400.0 | 428951.0 | 553449.0 | 114161.047085 | 147294.952915 |
20 | 2022 | 9 | homes_for_tourist_use | 216331 | 886061 | 4.095858 | 23.343804 | 926717.0 | 351656.0 | 575061.0 | 82089.887351 | 134241.112649 |
21 | 2022 | 10 | homes_for_tourist_use | 203164 | 900796 | 4.433837 | 21.628907 | 939317.0 | 311120.0 | 628197.0 | 67291.855337 | 135872.144663 |
22 | 2022 | 11 | homes_for_tourist_use | 183051 | 819159 | 4.475032 | 23.658039 | 773737.0 | 253096.0 | 520641.0 | 59877.549989 | 123173.450011 |
23 | 2022 | 12 | homes_for_tourist_use | 195322 | 966802 | 4.949785 | 26.436055 | 738847.0 | 379232.0 | 359615.0 | 100253.980464 | 95068.019536 |
In [112]:
q = """ SELECT *, (N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH*average_lenght_stay/30) AS 'N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY', (N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH*average_lenght_stay/30) AS 'N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY' FROM df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_2022 ;""" df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022 = duckdb.query(q).df() df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022
Out[112]:
year | month | lodging_type | N_TOURISTS_PER_LODGING_TYPE_PER_MONTH | overnight_stays | average_lenght_stay | LODGING_TYPE_SHARE | TOTAL_N_TOURISTS_PER_MONTH | TOTAL_N_NATIONAL_TOURISTS_PER_MONTH | TOTAL_N_FOREIGN_TOURISTS_PER_MONTH | N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_MONTH | N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_MONTH | N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY | N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | hotel | 301474 | 745224 | 2.471935 | 70.562654 | 427243.0 | 215132.0 | 212111.0 | 151802.848889 | 149671.151111 | 12508.223626 | 12332.576379 |
1 | 2022 | 2 | hotel | 423648 | 1044943 | 2.466536 | 75.779984 | 559050.0 | 319829.0 | 239221.0 | 242366.364712 | 181281.635288 | 19926.844634 | 14904.588703 |
2 | 2022 | 3 | hotel | 541594 | 1428371 | 2.637346 | 77.968029 | 694636.0 | 345227.0 | 349409.0 | 269166.688507 | 272427.311493 | 23662.860131 | 23949.506544 |
3 | 2022 | 4 | hotel | 663354 | 1794084 | 2.704565 | 77.289124 | 858276.0 | 372525.0 | 485751.0 | 287921.308355 | 375432.691645 | 25956.729615 | 33846.070377 |
4 | 2022 | 5 | hotel | 696487 | 1902416 | 2.731445 | 76.741621 | 907574.0 | 299170.0 | 608404.0 | 229587.907752 | 466899.092248 | 20903.558817 | 42510.307846 |
5 | 2022 | 6 | hotel | 701128 | 1945168 | 2.774341 | 76.980711 | 910784.0 | 322778.0 | 588006.0 | 248476.799751 | 452651.200249 | 22978.643923 | 41860.289420 |
6 | 2022 | 7 | hotel | 752240 | 2091958 | 2.780971 | 73.378172 | 1025155.0 | 429836.0 | 595319.0 | 315405.799747 | 436834.200253 | 29237.817980 | 40494.115342 |
7 | 2022 | 8 | hotel | 720944 | 2114854 | 2.933451 | 73.385993 | 982400.0 | 428951.0 | 553449.0 | 314789.952915 | 406154.047085 | 30780.698229 | 39714.435108 |
8 | 2022 | 9 | hotel | 710386 | 1851581 | 2.606444 | 76.656196 | 926717.0 | 351656.0 | 575061.0 | 269566.112649 | 440819.887351 | 23420.295088 | 38299.071575 |
9 | 2022 | 10 | hotel | 736153 | 1947345 | 2.645299 | 78.371093 | 939317.0 | 311120.0 | 628197.0 | 243828.144663 | 492324.855337 | 21499.947176 | 43411.552829 |
10 | 2022 | 11 | hotel | 590686 | 1511602 | 2.559062 | 76.341961 | 773737.0 | 253096.0 | 520641.0 | 193218.450011 | 397467.549989 | 16481.932054 | 33904.801287 |
11 | 2022 | 12 | hotel | 543525 | 1356166 | 2.495131 | 73.563945 | 738847.0 | 379232.0 | 359615.0 | 278978.019536 | 264546.980464 | 23202.888851 | 22002.644487 |
12 | 2022 | 1 | homes_for_tourist_use | 125769 | 618569 | 4.918295 | 29.437346 | 427243.0 | 215132.0 | 212111.0 | 63329.151111 | 62439.848889 | 10382.380839 | 10236.585827 |
13 | 2022 | 2 | homes_for_tourist_use | 135402 | 614543 | 4.538655 | 24.220016 | 559050.0 | 319829.0 | 239221.0 | 77462.635288 | 57939.364712 | 11719.206581 | 8765.560088 |
14 | 2022 | 3 | homes_for_tourist_use | 153042 | 625055 | 4.084206 | 22.031971 | 694636.0 | 345227.0 | 349409.0 | 76060.311493 | 76981.688507 | 10354.865114 | 10480.301554 |
15 | 2022 | 4 | homes_for_tourist_use | 194922 | 805706 | 4.133479 | 22.710876 | 858276.0 | 372525.0 | 485751.0 | 84603.691645 | 110318.308355 | 11656.919518 | 15199.947151 |
16 | 2022 | 5 | homes_for_tourist_use | 211087 | 791349 | 3.748923 | 23.258379 | 907574.0 | 299170.0 | 608404.0 | 69582.092248 | 141504.907752 | 8695.264531 | 17683.035471 |
17 | 2022 | 6 | homes_for_tourist_use | 209656 | 858558 | 4.095080 | 23.019289 | 910784.0 | 322778.0 | 588006.0 | 74301.200249 | 135354.799751 | 10142.310878 | 18476.289122 |
18 | 2022 | 7 | homes_for_tourist_use | 272915 | 1038668 | 3.805830 | 26.621828 | 1025155.0 | 429836.0 | 595319.0 | 114430.200253 | 158484.799747 | 14516.728314 | 20105.538353 |
19 | 2022 | 8 | homes_for_tourist_use | 261456 | 1136612 | 4.347240 | 26.614007 | 982400.0 | 428951.0 | 553449.0 | 114161.047085 | 147294.952915 | 16542.849281 | 21344.217386 |
20 | 2022 | 9 | homes_for_tourist_use | 216331 | 886061 | 4.095858 | 23.343804 | 926717.0 | 351656.0 | 575061.0 | 82089.887351 | 134241.112649 | 11207.616674 | 18327.749996 |
21 | 2022 | 10 | homes_for_tourist_use | 203164 | 900796 | 4.433837 | 21.628907 | 939317.0 | 311120.0 | 628197.0 | 67291.855337 | 135872.144663 | 9945.369934 | 20081.163397 |
22 | 2022 | 11 | homes_for_tourist_use | 183051 | 819159 | 4.475032 | 23.658039 | 773737.0 | 253096.0 | 520641.0 | 59877.549989 | 123173.450011 | 8931.797509 | 18373.502493 |
23 | 2022 | 12 | homes_for_tourist_use | 195322 | 966802 | 4.949785 | 26.436055 | 738847.0 | 379232.0 | 359615.0 | 100253.980464 | 95068.019536 | 16541.189900 | 15685.543430 |
NUMBER OF PLACES PER CATEGORY AND NEIGHBOURHOOD
This table agregates the number of places per lodging category for each neighbourhood.
Data refers to the tourist lodgings active at the end of 2022.
Data on those tourist lodgings is here treated as constant throughout 2022.
In [113]:
q = """ SELECT district_code, district_name, neighbourhood_code, neighbourhood_name, category, SUM(n_places) AS 'NEIGHBOURHOOD_N_PLACES' FROM df_tl GROUP BY 3,4,1,2,5 ORDER BY 3,5 ;""" df_n_places_per_neighbourhood_category_2022 = duckdb.query(q).df() df_n_places_per_neighbourhood_category_2022
Out[113]:
district_code | district_name | neighbourhood_code | neighbourhood_name | category | NEIGHBOURHOOD_N_PLACES | |
---|---|---|---|---|---|---|
0 | 1 | Ciutat Vella | 1 | el Raval | Albergs | 277.0 |
1 | 1 | Ciutat Vella | 1 | el Raval | Apartaments Turístics | 170.0 |
2 | 1 | Ciutat Vella | 1 | el Raval | Habitatges d’Ús Turístic | 1234.0 |
3 | 1 | Ciutat Vella | 1 | el Raval | Hotel 1 estrella | 416.0 |
4 | 1 | Ciutat Vella | 1 | el Raval | Hotel 2 estrelles | 430.0 |
… | … | … | … | … | … | … |
264 | 10 | Sant Martí | 70 | el Besòs i el Maresme | Hotel 4 estrelles superior | 524.0 |
265 | 10 | Sant Martí | 71 | Provençals del Poblenou | Habitatges d’Ús Turístic | 180.0 |
266 | 10 | Sant Martí | 71 | Provençals del Poblenou | Hotel 4 estrelles | 356.0 |
267 | 10 | Sant Martí | 72 | Sant Martí de Provençals | Habitatges d’Ús Turístic | 82.0 |
268 | 10 | Sant Martí | 73 | la Verneda i la Pau | Habitatges d’Ús Turístic | 49.0 |
269 rows × 6 columns
In [114]:
q = """ SELECT *, CASE WHEN CONTAINS(category, 'Turístic') THEN 'homes_for_tourist_use' ELSE 'hotel' END AS "LODGING_TYPE" FROM df_n_places_per_neighbourhood_category_2022 ;""" df_n_places_per_neighbourhood_category_lodging_type_2022 = duckdb.query(q).df() df_n_places_per_neighbourhood_category_lodging_type_2022
Out[114]:
district_code | district_name | neighbourhood_code | neighbourhood_name | category | NEIGHBOURHOOD_N_PLACES | LODGING_TYPE | |
---|---|---|---|---|---|---|---|
0 | 1 | Ciutat Vella | 1 | el Raval | Albergs | 277.0 | hotel |
1 | 1 | Ciutat Vella | 1 | el Raval | Apartaments Turístics | 170.0 | homes_for_tourist_use |
2 | 1 | Ciutat Vella | 1 | el Raval | Habitatges d’Ús Turístic | 1234.0 | homes_for_tourist_use |
3 | 1 | Ciutat Vella | 1 | el Raval | Hotel 1 estrella | 416.0 | hotel |
4 | 1 | Ciutat Vella | 1 | el Raval | Hotel 2 estrelles | 430.0 | hotel |
… | … | … | … | … | … | … | … |
264 | 10 | Sant Martí | 70 | el Besòs i el Maresme | Hotel 4 estrelles superior | 524.0 | hotel |
265 | 10 | Sant Martí | 71 | Provençals del Poblenou | Habitatges d’Ús Turístic | 180.0 | homes_for_tourist_use |
266 | 10 | Sant Martí | 71 | Provençals del Poblenou | Hotel 4 estrelles | 356.0 | hotel |
267 | 10 | Sant Martí | 72 | Sant Martí de Provençals | Habitatges d’Ús Turístic | 82.0 | homes_for_tourist_use |
268 | 10 | Sant Martí | 73 | la Verneda i la Pau | Habitatges d’Ús Turístic | 49.0 | homes_for_tourist_use |
269 rows × 7 columns
In [115]:
q = """ SELECT LODGING_TYPE as 'LODGING_TYPE_X',SUM(NEIGHBOURHOOD_N_PLACES) AS "TOTAL_N_PLACES_PER_LODGING_TYPE" FROM df_n_places_per_neighbourhood_category_lodging_type_2022 GROUP BY LODGING_TYPE ;""" df_total_n_places_per_lodging_type_2022 = duckdb.query(q).df() df_total_n_places_per_lodging_type_2022
Out[115]:
LODGING_TYPE_X | TOTAL_N_PLACES_PER_LODGING_TYPE | |
---|---|---|
0 | hotel | 87041.0 |
1 | homes_for_tourist_use | 57564.0 |
In [116]:
q = """ SELECT year,month,lodging_type AS 'lodging_type_x', N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY, N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY,TOTAL_N_PLACES_PER_LODGING_TYPE, N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY/TOTAL_N_PLACES_PER_LODGING_TYPE*100 AS 'PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES', N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY/TOTAL_N_PLACES_PER_LODGING_TYPE*100 AS 'PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES' FROM df_lodging_share_total_n_tourists_n_foreign_tourists_per_month_per_day_2022, df_total_n_places_per_lodging_type_2022 WHERE LODGING_TYPE=LODGING_TYPE_X ;""" df_n_national_n_foreign_tourists_per_n_places = duckdb.query(q).df() df_n_national_n_foreign_tourists_per_n_places
Out[116]:
year | month | lodging_type_x | N_NATIONAL_TOURISTS_PER_LODGING_TYPE_PER_DAY | N_FOREIGN_TOURISTS_PER_LODGING_TYPE_PER_DAY | TOTAL_N_PLACES_PER_LODGING_TYPE | PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES | PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES | |
---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | hotel | 12508.223626 | 12332.576379 | 87041.0 | 14.370496 | 14.168698 |
1 | 2022 | 2 | hotel | 19926.844634 | 14904.588703 | 87041.0 | 22.893630 | 17.123641 |
2 | 2022 | 3 | hotel | 23662.860131 | 23949.506544 | 87041.0 | 27.185878 | 27.515202 |
3 | 2022 | 4 | hotel | 25956.729615 | 33846.070377 | 87041.0 | 29.821268 | 38.885204 |
4 | 2022 | 5 | hotel | 20903.558817 | 42510.307846 | 87041.0 | 24.015761 | 48.839407 |
5 | 2022 | 6 | hotel | 22978.643923 | 41860.289420 | 87041.0 | 26.399793 | 48.092611 |
6 | 2022 | 7 | hotel | 29237.817980 | 40494.115342 | 87041.0 | 33.590857 | 46.523036 |
7 | 2022 | 8 | hotel | 30780.698229 | 39714.435108 | 87041.0 | 35.363447 | 45.627273 |
8 | 2022 | 9 | hotel | 23420.295088 | 38299.071575 | 87041.0 | 26.907199 | 44.001185 |
9 | 2022 | 10 | hotel | 21499.947176 | 43411.552829 | 87041.0 | 24.700942 | 49.874832 |
10 | 2022 | 11 | hotel | 16481.932054 | 33904.801287 | 87041.0 | 18.935826 | 38.952679 |
11 | 2022 | 12 | hotel | 23202.888851 | 22002.644487 | 87041.0 | 26.657424 | 25.278483 |
12 | 2022 | 1 | homes_for_tourist_use | 10382.380839 | 10236.585827 | 57564.0 | 18.036239 | 17.782965 |
13 | 2022 | 2 | homes_for_tourist_use | 11719.206581 | 8765.560088 | 57564.0 | 20.358569 | 15.227503 |
14 | 2022 | 3 | homes_for_tourist_use | 10354.865114 | 10480.301554 | 57564.0 | 17.988439 | 18.206347 |
15 | 2022 | 4 | homes_for_tourist_use | 11656.919518 | 15199.947151 | 57564.0 | 20.250364 | 26.405300 |
16 | 2022 | 5 | homes_for_tourist_use | 8695.264531 | 17683.035471 | 57564.0 | 15.105386 | 30.718914 |
17 | 2022 | 6 | homes_for_tourist_use | 10142.310878 | 18476.289122 | 57564.0 | 17.619191 | 32.096951 |
18 | 2022 | 7 | homes_for_tourist_use | 14516.728314 | 20105.538353 | 57564.0 | 25.218415 | 34.927278 |
19 | 2022 | 8 | homes_for_tourist_use | 16542.849281 | 21344.217386 | 57564.0 | 28.738186 | 37.079107 |
20 | 2022 | 9 | homes_for_tourist_use | 11207.616674 | 18327.749996 | 57564.0 | 19.469836 | 31.838910 |
21 | 2022 | 10 | homes_for_tourist_use | 9945.369934 | 20081.163397 | 57564.0 | 17.277065 | 34.884934 |
22 | 2022 | 11 | homes_for_tourist_use | 8931.797509 | 18373.502493 | 57564.0 | 15.516291 | 31.918391 |
23 | 2022 | 12 | homes_for_tourist_use | 16541.189900 | 15685.543430 | 57564.0 | 28.735303 | 27.248877 |
CONCATENATED TABLE: NUMBER OF NATIONAL AND FOREIGN TOURISTS PER DAY, CATEGORY AND NEIGHBOURHOOD IN 2022
In [123]:
q = """ SELECT year, month, district_code, district_name, neighbourhood_code, neighbourhood_name, category, LODGING_TYPE, NEIGHBOURHOOD_N_PLACES, PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES+PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_BOOKED_PLACES', 100-(PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES+PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES) AS 'PERCENTAGE_FREE_PLACES', PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS', PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES AS 'PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS', (PERCENTAGE_NATIONAL_TOURISTS_PER_N_PLACES/100*NEIGHBOURHOOD_N_PLACES) AS 'N_NATIONAL_TOURISTS_PER_DAY', (PERCENTAGE_FOREIGN_TOURISTS_PER_N_PLACES/100*NEIGHBOURHOOD_N_PLACES) AS 'N_FOREIGN_TOURISTS_PER_DAY', N_NATIONAL_TOURISTS_PER_DAY+N_FOREIGN_TOURISTS_PER_DAY AS 'TOTAL_N_TOURISTS_PER_DAY' FROM df_n_places_per_neighbourhood_category_lodging_type_2022, df_n_national_n_foreign_tourists_per_n_places WHERE lodging_type_x=LODGING_TYPE ;""" df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 = duckdb.query(q).df() df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022
Out[123]:
year | month | district_code | district_name | neighbourhood_code | neighbourhood_name | category | LODGING_TYPE | NEIGHBOURHOOD_N_PLACES | PERCENTAGE_BOOKED_PLACES | PERCENTAGE_FREE_PLACES | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS | N_NATIONAL_TOURISTS_PER_DAY | N_FOREIGN_TOURISTS_PER_DAY | TOTAL_N_TOURISTS_PER_DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Albergs | hotel | 277.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 73.841066 | 70.021398 | 143.862464 |
1 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Apartaments Turístics | homes_for_tourist_use | 170.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 48.850015 | 46.323091 | 95.173106 |
2 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Habitatges d’Ús Turístic | homes_for_tourist_use | 1234.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 354.593641 | 336.251139 | 690.844780 |
3 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Hotel 1 estrella | hotel | 416.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 110.894886 | 105.158490 | 216.053376 |
4 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Hotel 2 estrelles | hotel | 430.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 114.626925 | 108.697477 | 223.324403 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
3223 | 2022 | 1 | 10 | Sant Martí | 70 | el Besòs i el Maresme | Hotel 4 estrelles superior | hotel | 524.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 75.301400 | 74.243977 | 149.545377 |
3224 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Habitatges d’Ús Turístic | homes_for_tourist_use | 180.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 32.465231 | 32.009337 | 64.474567 |
3225 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Hotel 4 estrelles | hotel | 356.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 51.158967 | 50.440565 | 101.599531 |
3226 | 2022 | 1 | 10 | Sant Martí | 72 | Sant Martí de Provençals | Habitatges d’Ús Turístic | homes_for_tourist_use | 82.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 14.789716 | 14.582031 | 29.371747 |
3227 | 2022 | 1 | 10 | Sant Martí | 73 | la Verneda i la Pau | Habitatges d’Ús Turístic | homes_for_tourist_use | 49.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 8.837757 | 8.713653 | 17.551410 |
3228 rows × 16 columns
VERIFICATION¶
data tested on month 11
In [124]:
q = """ SELECT (SELECT SUM(N_FOREIGN_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='hotel' AND month=11)*30/2.559062 + (SELECT SUM(N_FOREIGN_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11)*30/4.475032, SUM(NEIGHBOURHOOD_N_PLACES) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE month=11 ;""" duckdb.query(q).df()
Out[124]:
((((SELECT sum(“N_FOREIGN_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) * 30) / 2.559062) + (((SELECT sum(“N_FOREIGN_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11))) * 30) / 4.475032)) | sum(“NEIGHBOURHOOD_N_PLACES”) | |
---|---|---|
0 | 520640.96227 | 144605.0 |
In [125]:
q = """ SELECT SUM(FT_N_TOURISTS) FROM df_tourism_ft WHERE FT_MONTH=11 AND FT_YEAR=2022 ;""" duckdb.query(q).df()
Out[125]:
sum(“FT_N_TOURISTS”) | |
---|---|
0 | 520641.0 |
In [126]:
q = """ SELECT SUM(n_places) FROM df_tourism_tl ;""" duckdb.query(q).df()
Out[126]:
sum(n_places) | |
---|---|
0 | 144605.0 |
In [127]:
q = """ SELECT (SELECT SUM(TOTAL_N_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='hotel' AND month=11)*30/2.559062 + (SELECT SUM(TOTAL_N_TOURISTS_PER_DAY) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11)*30/4.475032 FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE month=11 LIMIT 1 ;""" duckdb.query(q).df()
Out[127]:
((((SELECT sum(“TOTAL_N_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) * 30) / 2.559062) + (((SELECT sum(“TOTAL_N_TOURISTS_PER_DAY”) FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11))) * 30) / 4.475032)) | |
---|---|
0 | 773736.943928 |
In [129]:
q = """ SELECT (SELECT SUM(N_TOURISTS_PER_LODGING_TYPE_PER_MONTH) FROM df_lodging_share_total_n_tourists_2022 WHERE LODGING_TYPE='hotel' AND month=11) + (SELECT SUM(N_TOURISTS_PER_LODGING_TYPE_PER_MONTH) FROM df_lodging_share_total_n_tourists_2022 WHERE LODGING_TYPE='homes_for_tourist_use' AND month=11) FROM df_lodging_share_total_n_tourists_2022 WHERE month=11 LIMIT 1 ;""" duckdb.query(q).df()
Out[129]:
((SELECT sum(“N_TOURISTS_PER_LODGING_TYPE_PER_MONTH”) FROM df_lodging_share_total_n_tourists_2022 WHERE ((“LODGING_TYPE” = ‘hotel’) AND (“month” = 11))) + (SELECT sum(“N_TOURISTS_PER_LODGING_TYPE_PER_MONTH”) FROM df_lodging_share_total_n_tourists_2022 WHERE ((“LODGING_TYPE” = ‘homes_for_tourist_use’) AND (“month” = 11)))) | |
---|---|
0 | 773737.0 |
EXCEL EXPORT FILE¶
NORMALIZATION
In [130]:
#ROUNDING VALUES q = """ SELECT year, month, district_code, district_name, neighbourhood_code, neighbourhood_name, category, LODGING_TYPE, NEIGHBOURHOOD_N_PLACES, PERCENTAGE_BOOKED_PLACES,PERCENTAGE_FREE_PLACES,PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS,PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS, ROUND(N_NATIONAL_TOURISTS_PER_DAY) AS 'N_NATIONAL_TOURISTS_PER_DAY', ROUND(N_FOREIGN_TOURISTS_PER_DAY) AS 'N_FOREIGN_TOURISTS_PER_DAY', ROUND(TOTAL_N_TOURISTS_PER_DAY) AS 'TOTAL_N_TOURISTS_PER_DAY' FROM df_n_national_n_foreign_tourists_per_day_category_neighbourhood_2022 ;""" df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022 = duckdb.query(q).df() df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022
Out[130]:
year | month | district_code | district_name | neighbourhood_code | neighbourhood_name | category | LODGING_TYPE | NEIGHBOURHOOD_N_PLACES | PERCENTAGE_BOOKED_PLACES | PERCENTAGE_FREE_PLACES | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS | N_NATIONAL_TOURISTS_PER_DAY | N_FOREIGN_TOURISTS_PER_DAY | TOTAL_N_TOURISTS_PER_DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Albergs | hotel | 277.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 74.0 | 70.0 | 144.0 |
1 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Apartaments Turístics | homes_for_tourist_use | 170.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 49.0 | 46.0 | 95.0 |
2 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Habitatges d’Ús Turístic | homes_for_tourist_use | 1234.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 355.0 | 336.0 | 691.0 |
3 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Hotel 1 estrella | hotel | 416.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 111.0 | 105.0 | 216.0 |
4 | 2022 | 12 | 1 | Ciutat Vella | 1 | el Raval | Hotel 2 estrelles | hotel | 430.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 115.0 | 109.0 | 223.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
3223 | 2022 | 1 | 10 | Sant Martí | 70 | el Besòs i el Maresme | Hotel 4 estrelles superior | hotel | 524.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 75.0 | 74.0 | 150.0 |
3224 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Habitatges d’Ús Turístic | homes_for_tourist_use | 180.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 32.0 | 32.0 | 64.0 |
3225 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Hotel 4 estrelles | hotel | 356.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 51.0 | 50.0 | 102.0 |
3226 | 2022 | 1 | 10 | Sant Martí | 72 | Sant Martí de Provençals | Habitatges d’Ús Turístic | homes_for_tourist_use | 82.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 15.0 | 15.0 | 29.0 |
3227 | 2022 | 1 | 10 | Sant Martí | 73 | la Verneda i la Pau | Habitatges d’Ús Turístic | homes_for_tourist_use | 49.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 9.0 | 9.0 | 18.0 |
3228 rows × 16 columns
In [131]:
#distric_code and neighbourhood_code need a '0' in front of all numbers below 10 to combine with spacial data df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['district_code'] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['district_code'].astype(str).str.strip() df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['neighbourhood_code'] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022['neighbourhood_code'].astype(str).str.strip() df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['district_code']] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['district_code']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['neighbourhood_code']] = df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022[['neighbourhood_code']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [132]:
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022
Out[132]:
year | month | district_code | district_name | neighbourhood_code | neighbourhood_name | category | LODGING_TYPE | NEIGHBOURHOOD_N_PLACES | PERCENTAGE_BOOKED_PLACES | PERCENTAGE_FREE_PLACES | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOURISTS | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOURISTS | N_NATIONAL_TOURISTS_PER_DAY | N_FOREIGN_TOURISTS_PER_DAY | TOTAL_N_TOURISTS_PER_DAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 12 | 01 | Ciutat Vella | 01 | el Raval | Albergs | hotel | 277.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 74.0 | 70.0 | 144.0 |
1 | 2022 | 12 | 01 | Ciutat Vella | 01 | el Raval | Apartaments Turístics | homes_for_tourist_use | 170.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 49.0 | 46.0 | 95.0 |
2 | 2022 | 12 | 01 | Ciutat Vella | 01 | el Raval | Habitatges d’Ús Turístic | homes_for_tourist_use | 1234.0 | 57.470606 | 42.529394 | 28.735303 | 27.248877 | 355.0 | 336.0 | 691.0 |
3 | 2022 | 12 | 01 | Ciutat Vella | 01 | el Raval | Hotel 1 estrella | hotel | 416.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 111.0 | 105.0 | 216.0 |
4 | 2022 | 12 | 01 | Ciutat Vella | 01 | el Raval | Hotel 2 estrelles | hotel | 430.0 | 53.314849 | 46.685151 | 26.657424 | 25.278483 | 115.0 | 109.0 | 223.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
3223 | 2022 | 1 | 10 | Sant Martí | 70 | el Besòs i el Maresme | Hotel 4 estrelles superior | hotel | 524.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 75.0 | 74.0 | 150.0 |
3224 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Habitatges d’Ús Turístic | homes_for_tourist_use | 180.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 32.0 | 32.0 | 64.0 |
3225 | 2022 | 1 | 10 | Sant Martí | 71 | Provençals del Poblenou | Hotel 4 estrelles | hotel | 356.0 | 28.740992 | 71.259008 | 14.370496 | 14.168698 | 51.0 | 50.0 | 102.0 |
3226 | 2022 | 1 | 10 | Sant Martí | 72 | Sant Martí de Provençals | Habitatges d’Ús Turístic | homes_for_tourist_use | 82.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 15.0 | 15.0 | 29.0 |
3227 | 2022 | 1 | 10 | Sant Martí | 73 | la Verneda i la Pau | Habitatges d’Ús Turístic | homes_for_tourist_use | 49.0 | 36.072479 | 63.927521 | 18.036239 | 17.782965 | 9.0 | 9.0 | 18.0 |
3228 rows × 16 columns
In [133]:
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022.to_excel('T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022.xlsx', sheet_name='sql_touristsXday')
DATABASE UPLOAD¶
In [134]:
df_n_national_n_foreign_tourists_per_day_category_neighbourhood_r_2022.to_gbq( destination_table="TOURISM.T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022", project_id=project_id)
In [135]:
gbq.read_gbq(query='SELECT table_name FROM TOURISM.INFORMATION_SCHEMA.TABLES;')
Out[135]:
table_name | |
---|---|
0 | TL |
1 | TT |
2 | FT |
3 | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… |
In [137]:
gbq.read_gbq(query='SELECT table_schema,table_name,column_name,data_type FROM TOURISM.INFORMATION_SCHEMA.COLUMNS WHERE table_name = "T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PER_DAY_2022";')
Out[137]:
table_schema | table_name | column_name | data_type | |
---|---|---|---|---|
0 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | year | INT64 |
1 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | month | INT64 |
2 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | district_code | STRING |
3 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | district_name | STRING |
4 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | neighbourhood_code | STRING |
5 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | neighbourhood_name | STRING |
6 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | category | STRING |
7 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | LODGING_TYPE | STRING |
8 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | NEIGHBOURHOOD_N_PLACES | FLOAT64 |
9 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | PERCENTAGE_BOOKED_PLACES | FLOAT64 |
10 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | PERCENTAGE_FREE_PLACES | FLOAT64 |
11 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_NATIONAL_TOU… | FLOAT64 |
12 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | PERCENTAGE_TOTAL_PLACES_BOOKED_BY_FOREIGN_TOUR… | FLOAT64 |
13 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | N_NATIONAL_TOURISTS_PER_DAY | FLOAT64 |
14 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | N_FOREIGN_TOURISTS_PER_DAY | FLOAT64 |
15 | TOURISM | T_SQL_BIGQUERY_N_TOURISTS_PER_NEIGHBOURHOOD_PE… | TOTAL_N_TOURISTS_PER_DAY | FLOAT64 |
JUPYTER LAB – Barcelona Tourist Lodgings
TROUBLESHOOTING VERSION¶
BARCELONA TOURISM¶
TOURIST LODGINGS – CAPACITY AND DISTRIBUTION BY NEIGHBOURHOOD¶
REQUIRED LIBRARIES¶
In [536]:
#FOR DATA EXTRACTION AND CLEANING: import pandas as pd import requests
DATA EXTRACTION AND CLEANING¶
DATAFRAME: DISTRICT_NEIGHBOURHOOD_TABLE¶
This file contains the codes and names of districts and neighbourhoods used in other projects.
The file is already cleaned and used here as a conversion table to make data compatible for visualizations with spatial files.
In [537]:
df_district_neighbourhood_table = pd.read_excel('F:DPortfolio ProjectsBARCELONA TOURISMT_NEIGHBOURHOODS.xlsx',converters={'District_Code':str,'Neighbourhood_Code':str}) df_district_neighbourhood_table
Out[537]:
District_Code | District_Name | Neighbourhood_Code | Neighbourhood_Name | |
---|---|---|---|---|
0 | 01 | Ciutat Vella | 01 | el Raval |
1 | 01 | Ciutat Vella | 02 | el Barri Gòtic |
2 | 01 | Ciutat Vella | 03 | la Barceloneta |
3 | 01 | Ciutat Vella | 04 | Sant Pere, Santa Caterina i la Ribera |
4 | 02 | Eixample | 05 | el Fort Pienc |
… | … | … | … | … |
68 | 10 | Sant Martí | 69 | Diagonal Mar i el Front Marítim del Poblenou |
69 | 10 | Sant Martí | 70 | el Besòs i el Maresme |
70 | 10 | Sant Martí | 71 | Provençals del Poblenou |
71 | 10 | Sant Martí | 72 | Sant Martí de Provençals |
72 | 10 | Sant Martí | 73 | la Verneda i la Pau |
73 rows × 4 columns
DATAFRAME: NUMBER OF PLACES BY ESTABLISHMENT¶
SOURCE:
https://ajuntament.barcelona.cat/ecologiaurbana/ca/tramits/activitats/cens
http://w121.bcn.cat/APPS/censactivitats/cceatDef.do?reqCode=search
In [538]:
df_np = pd.read_excel('extraccio.xlsx') df_np.head(1)
Out[538]:
Núm. Expedient | RTC | Descripció categoria | Emplaçament | Tipus carrer | Carrer | Primer número | Primera lletra | Segon número | Segona lletra | Bloc | Portal | Escala | Pis | Porta | Barri | Districte | Zona | Descripció zona | Núm. places allotjament turístic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | Hotel 3 estrelles | GRAVINA 5 7 | NaN | GRAVINA | 5 | NaN | 7.0 | NaN | NaN | NaN | NaN | NaN | NaN | el Raval | CIUTAT VELLA | ZE-1 | ZONA DE DECREIXEMENT NATURAL | 86.0 |
In [539]:
df_np1 = df_np.copy()
EXTRACT RELEVANT DATA¶
In [540]:
#DROP THE ZONE AND ZONE_DESCRIPTION COLUMNS df_np1.drop(columns = ['Zona','Descripció zona'], inplace = True)
In [541]:
#RENAME COLUMNS IN ENGLISH df_np1.columns = ['n_practice','rtc','category','address','street_type', 'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block', 'entrance','stair','floor','door','neighbourhood_name','district_name','n_places'] df_np1.head(1)
Out[541]:
n_practice | rtc | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | street_letter_2 | block | entrance | stair | floor | door | neighbourhood_name | district_name | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | Hotel 3 estrelles | GRAVINA 5 7 | NaN | GRAVINA | 5 | NaN | 7.0 | NaN | NaN | NaN | NaN | NaN | NaN | el Raval | CIUTAT VELLA | 86.0 |
In [542]:
#ADD EMPTY COLUMNS FOR RELEVANT DATA TO ADD FROM OTHER TABLES: df_np1['neighbourhood_code'] = None df_np1['district_code'] = None df_np1['longitude'] = None df_np1['latitude'] = None df_np1['name'] = None
In [543]:
#REARRANGE COLUMNS IN THE PREFERRED ORDER df_np1.columns
Out[543]:
Index(['n_practice', 'rtc', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'neighbourhood_name', 'district_name', 'n_places', 'neighbourhood_code', 'district_code', 'longitude', 'latitude', 'name'], dtype='object')
In [544]:
#REARRANGE COLUMNS IN THE PREFERRED ORDER df_np1 = df_np1[['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code','district_name', 'neighbourhood_code','neighbourhood_name', 'longitude', 'latitude', 'n_places']] df_np1.head(1)
Out[544]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | NaN | GRAVINA | 5 | NaN | 7.0 | … | NaN | NaN | NaN | None | CIUTAT VELLA | None | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [545]:
#CHECK THE CATEGORY ATTRIBUTE df_np1['category'].value_counts()
Out[545]:
Habitatges d'Ús Turístic 9409 Pensió 286 Hotel 4 estrelles 144 Albergs 126 Hotel 3 estrelles 119 Residències estudiants en sòl de zona 64 Hotel 1 estrella 47 Hotel 2 estrelles 42 Hotel 4 estrelles superior 27 Hotel 5 estrelles 24 Hotel gran luxe 20 Hotel-Apart 4 estrelles 12 Apartaments Turístics 12 Hotel-Apart 3 estrelles 10 Hotel-Apart 2 estrelles 6 Hotel-Apart 1 estrella 2 Hotel-Apart 4 estrelles superior 1 Name: category, dtype: int64
In [546]:
#SINCE THE STUDY FOCUSES ON ESTABLISHMENT FOR TOURIST, THE CATEGORY 'Residències estudiants en sòl de zona' (STUDENT RESIDENCIES) IS DROPPED df_np1 = df_np1[df_np1['category']!='Residències estudiants en sòl de zona'] df_np1[df_np1['category']=='Residències estudiants en sòl de zona'].head(1)
Out[546]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [547]:
df_np2 = df_np1.copy()
In [548]:
df_np2.head(5)
Out[548]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | NaN | GRAVINA | 5 | NaN | 7.0 | … | NaN | NaN | NaN | None | CIUTAT VELLA | None | el Raval | None | None | 86.0 |
1 | 01-87-A-372248 | HB-003827 | None | Pensió | TALLERS 6 8 | NaN | TALLERS | 6 | NaN | 8.0 | … | NaN | NaN | NaN | None | CIUTAT VELLA | None | el Raval | None | None | 11.0 |
2 | 00-2002-0134 | HB-004190 | None | Hotel 4 estrelles | AV BOGATELL 64 66 | AV | BOGATELL | 64 | NaN | 66.0 | … | NaN | NaN | NaN | None | SANT MARTI | None | la Vila Olímpica del Poblenou | None | None | 440.0 |
3 | 07-2013-0168 | HUTB-007570 | None | Habitatges d’Ús Turístic | AV CAN BARO 22 3 1 | AV | CAN BARO | 22 | NaN | NaN | … | NaN | 3 | 1 | None | HORTA-GUINARDÓ | None | Can Baró | None | None | 3.0 |
4 | 07-2014-0121 | HUTB-009724 | None | Habitatges d’Ús Turístic | AV CAN BARO 3 1 2 | AV | CAN BARO | 3 | NaN | NaN | … | NaN | 1 | 2 | None | HORTA-GUINARDÓ | None | Can Baró | None | None | 4.0 |
5 rows × 23 columns
WHITESPACES¶
In [549]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_np2['n_practice'] = df_np2['n_practice'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['rtc'] = df_np2['rtc'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['category'] = df_np2['category'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['address'] = df_np2['address'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['street_type'] = df_np2['street_type'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['street'] = df_np2['street'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['street_number_1'] = df_np2['street_number_1'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['street_letter_1'] = df_np2['street_letter_1'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['street_number_2'] = df_np2['street_number_2'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['street_letter_2'] = df_np2['street_letter_2'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['block'] = df_np2['block'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['entrance'] = df_np2['entrance'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['stair'] = df_np2['stair'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['floor'] = df_np2['floor'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['door'] = df_np2['door'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_np2['district_name'] = df_np2['district_name'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['neighbourhood_name'] = df_np2['neighbourhood_name'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_np2['n_places'] = df_np2['n_places'].astype(float)
In [550]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_np2 = df_np2.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)
In [551]:
df_np3 = df_np2.copy()
DUPLICATES¶
In [552]:
df_np3.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10287 entries, 0 to 10350 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 10287 non-null object 1 rtc 10287 non-null object 2 name 0 non-null object 3 category 10287 non-null object 4 address 10287 non-null object 5 street_type 10287 non-null object 6 street 10287 non-null object 7 street_number_1 10287 non-null object 8 street_letter_1 125 non-null object 9 street_number_2 886 non-null object 10 street_letter_2 4 non-null object 11 block 14 non-null object 12 entrance 3 non-null object 13 stair 690 non-null object 14 floor 9691 non-null object 15 door 8697 non-null object 16 district_code 0 non-null object 17 district_name 10287 non-null object 18 neighbourhood_code 0 non-null object 19 neighbourhood_name 10287 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 10270 non-null float64 dtypes: float64(1), object(22) memory usage: 1.9+ MB
In [553]:
#PRELIMINARY CHECK FOR DUPLICATES df_np3.duplicated().value_counts()
Out[553]:
False 10287 dtype: int64
ADDRESS¶
In [554]:
#CHECK FOR DUPLICATES BY EXCLUDING: N_PRACTICE, RTC AND N_PLACES - TO SEE IF THERE ARE DUPLICATES ONLY IN TERMS OF CATEGORY AND ADDRESS df_np3[df_np3.duplicated(subset=['category','address','street_type', 'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block', 'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False)].sort_values('rtc')
Out[554]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4502 | 05-2009-0237 | ALB-427 | None | Albergs | C MAJOR DEL RECTORET 2 | C | MAJOR DEL RECTORET | 2 | None | None | … | None | None | None | None | SARRIA-SANT GERVASI | None | Vallvidrera, el Tibidabo i les Planes | None | None | 247.0 |
4503 | 05-2004-0005 | ALB-427 | None | Albergs | C MAJOR DEL RECTORET 2 | C | MAJOR DEL RECTORET | 2 | None | None | … | None | None | None | None | SARRIA-SANT GERVASI | None | Vallvidrera, el Tibidabo i les Planes | None | None | NaN |
2 rows × 23 columns
In [555]:
#FROM THE PREVIOUS CHECK, THERE APPEAR TO BE 1 CASE WITH 2 DISTINCT N_PRACTICE RECORDS WHERE THE VARIABLE OF INTEREST - N_PLACES - IS ONLY PRESENT IN ONE OF THE DUPLICATED VALUES. #TO SELECT THE DUPLICATED VALUES WITH NO VARIABLE OF INTEREST: df_np3[df_np3.duplicated(subset=['category','address','street_type', 'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block', 'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False) & (df_np2['n_places'].isnull() | (df_np2['n_places']==''))]
Out[555]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4503 | 05-2004-0005 | ALB-427 | None | Albergs | C MAJOR DEL RECTORET 2 | C | MAJOR DEL RECTORET | 2 | None | None | … | None | None | None | None | SARRIA-SANT GERVASI | None | Vallvidrera, el Tibidabo i les Planes | None | None | NaN |
1 rows × 23 columns
In [556]:
#DROP THE DUPLICATED CASES WITH NO VARIBLE OF INTEREST df_np3.drop(df_np3[(df_np3.duplicated(subset=['category','address','street_type', 'street','street_number_1','street_letter_1','street_number_2', 'street_letter_2','block', 'entrance','stair','floor','door','neighbourhood_name','district_name'], keep=False)) & (df_np3['n_places'].isnull() | (df_np3['n_places']==''))].index, inplace=True)
In [557]:
#VERIFY THAT THE RIGHT RECORDS ARE GONE df_np3[df_np3['n_practice']=='05-2004-0005']
Out[557]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [558]:
df_np4 = df_np3.copy()
N_PRACTICE¶
In [559]:
#CHECK IF N_PRACTICE CONTAINS NULL VALUES df_np4[(df_np4['n_practice'].isnull()) | (df_np4['n_practice']== None) | (df_np4['n_practice']=='nan') | (df_np4['n_practice']=='')]
Out[559]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [560]:
#CHECK IF N_PRACTICE CONTAINS DUPLICATED VALUES df_np4['n_practice'].value_counts()
Out[560]:
10-2001-0694 2 06-2017-0212 2 02-2015-0048 2 06-2010-0423 2 01-90-A-128 1 .. 06-2013-0508 1 06-2012-0606 1 06-2012-0353 1 06-2014-0361 1 01-NT-0075 1 Name: n_practice, Length: 10282, dtype: int64
In [561]:
#FIND N_PRACTICE DUPLICATED VALUES df_np4[df_np4.duplicated(subset=['n_practice'], keep=False)].sort_values('n_practice')
Out[561]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8737 | 02-2015-0048 | ALB-625 | None | Albergs | G.V. CORTS CATALANES 580 BJ | G.V. | CORTS CATALANES | 580 | None | None | … | None | BJ | None | None | L’EIXAMPLE | None | Sant Antoni | None | None | 216.0 |
8738 | 02-2015-0048 | HB-004629 | None | Hotel 1 estrella | G.V. CORTS CATALANES 580 BJ | G.V. | CORTS CATALANES | 580 | None | None | … | None | BJ | None | None | L’EIXAMPLE | None | Sant Antoni | None | None | 42.0 |
2609 | 06-2010-0423 | ALB-556 | None | Albergs | C CORSEGA 373 375 | C | CORSEGA | 373 | None | 375 | … | None | None | None | None | GRACIA | None | la Vila de Gràcia | None | None | 646.0 |
2610 | 06-2010-0423 | HB-004525 | None | Hotel 1 estrella | C CORSEGA 373 375 | C | CORSEGA | 373 | None | 375 | … | None | None | None | None | GRACIA | None | la Vila de Gràcia | None | None | 81.0 |
9038 | 06-2017-0212 | ALB-565 | None | Albergs | PG GRACIA 116 | PG | GRACIA | 116 | None | None | … | None | None | None | None | GRACIA | None | la Vila de Gràcia | None | None | 446.0 |
9039 | 06-2017-0212 | HB-004682 | None | Hotel 1 estrella | PG GRACIA 116 | PG | GRACIA | 116 | None | None | … | None | None | None | None | GRACIA | None | la Vila de Gràcia | None | None | 23.0 |
5667 | 10-2001-0694 | HB-004532 | None | Hotel 5 estrelles | C PERE IV 272 | C | PERE IV | 272 | None | None | … | None | None | None | None | SANT MARTI | None | el Poblenou | None | None | 86.0 |
5668 | 10-2001-0694 | HB-004358 | None | Hotel 4 estrelles superior | C PERE IV 272 | C | PERE IV | 272 | None | None | … | None | None | None | None | SANT MARTI | None | el Poblenou | None | None | 430.0 |
8 rows × 23 columns
The records above appear to be related to cases where establishments have sections belonging to different categories within the same establishment – therefore, these are not dropped
In [562]:
df_np5 = df_np4.copy()
RTC¶
In [563]:
#CHECK IF RTC CONTAINS NULL VALUES df_np5[(df_np5['rtc'].isnull()) | (df_np5['rtc']== None) | (df_np5['rtc']=='nan') | (df_np5['rtc']=='')]
Out[563]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [564]:
#CHECK IF RTC CONTAINS DUPLICATED VALUES df_np5['rtc'].value_counts().head(6)
Out[564]:
Pendent 68 HB-000957 3 HB-001951 2 HB-004669 2 ALB-562 2 HB-003893 1 Name: rtc, dtype: int64
In [565]:
#RTC DUPLICATED VALUES WITH "Pendent" (PENDING) VALUES - CATEGORY TYPES df_np5[df_np5['rtc'] == 'Pendent'].value_counts('category')
Out[565]:
category Habitatges d'Ús Turístic 48 Pensió 7 Hotel 3 estrelles 4 Hotel 4 estrelles 3 Albergs 2 Hotel 5 estrelles 2 Hotel 1 estrella 1 Hotel 2 estrelles 1 dtype: int64
In [566]:
#RTC DUPLICATED VALUES - EXCLUDING "Pendent" (PENDING) VALUES df_np5[(df_np5.duplicated(subset=['rtc'], keep=False)) & (df_np5['rtc'] != 'Pendent')].sort_values('rtc')
Out[566]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
148 | 05-2013-0284 | ALB-562 | None | Albergs | AV DIAGONAL 578 3 | AV | DIAGONAL | 578 | None | None | … | None | 3 | None | None | SARRIA-SANT GERVASI | None | Sant Gervasi – Galvany | None | None | 19.0 |
149 | 05-2016-0268 | ALB-562 | None | Albergs | AV DIAGONAL 578 5 | AV | DIAGONAL | 578 | None | None | … | None | 5 | None | None | SARRIA-SANT GERVASI | None | Sant Gervasi – Galvany | None | None | 19.0 |
3429 | 03-1999-0004 | HB-000957 | None | Pensió | C FONTRODONA 1 1 1 | C | FONTRODONA | 1 | None | None | … | None | 1 | 1 | None | SANTS-MONTJUÏC | None | el Poble Sec | None | None | NaN |
3430 | 03-1998-0472 | HB-000957 | None | Pensió | C FONTRODONA 1 2 3 | C | FONTRODONA | 1 | None | None | … | None | 2 | 3 | None | SANTS-MONTJUÏC | None | el Poble Sec | None | None | NaN |
3431 | 03-2002-0222 | HB-000957 | None | Pensió | C FONTRODONA 1 EN 3 | C | FONTRODONA | 1 | None | None | … | None | EN | 3 | None | SANTS-MONTJUÏC | None | el Poble Sec | None | None | 23.0 |
7528 | 01-89-A-200 | HB-001951 | None | Pensió | C TALLERS 82 1º 1ª | C | TALLERS | 82 | None | None | … | None | 1º | 1ª | None | CIUTAT VELLA | None | el Raval | None | None | 25.0 |
7529 | 01-91-A-116 | HB-001951 | None | Pensió | C TALLERS 82 2º 1ª | C | TALLERS | 82 | None | None | … | None | 2º | 1ª | None | CIUTAT VELLA | None | el Raval | None | None | NaN |
6485 | 02-2016-1310 | HB-004669 | None | Hotel 4 estrelles | C ROGER DE LLURIA 17 | C | ROGER DE LLURIA | 17 | None | None | … | None | None | None | None | L’EIXAMPLE | None | la Dreta de l’Eixample | None | None | 68.0 |
8740 | 02-2016-1367 | HB-004669 | None | Pensió | G.V. CORTS CATALANES 584 3 2 | G.V. | CORTS CATALANES | 584 | None | None | … | None | 3 | 2 | None | L’EIXAMPLE | None | Sant Antoni | None | None | 14.0 |
9 rows × 23 columns
The first 7 records refer to different floors within the same establishments.
However, only in one case the variable of interest – n_places – is indicated for all floors.
In the other cases, the varible of interest – n_places – is indicated only in one of the floors.
Records are not modified as it might be possible to correct them by crossreferencing them with other tables later.
The last 2 records might be a case where the RTC code was wrongly inputed as the same RTC code refers to 2 completely different establishments.
By crossreferencing with later tables, a match was found only for n_practice 02-2016-1310. Therefore, the rtc value for the other record n_practice 02-2016-1367 is modified with the suffix ERROR.
In [567]:
df_np5.loc[df_np5['n_practice']=='02-2016-1367', 'rtc'] = 'ERROR-HB-004669' df_np5[df_np5['n_practice']=='02-2016-1367']
Out[567]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8740 | 02-2016-1367 | ERROR-HB-004669 | None | Pensió | G.V. CORTS CATALANES 584 3 2 | G.V. | CORTS CATALANES | 584 | None | None | … | None | 3 | 2 | None | L’EIXAMPLE | None | Sant Antoni | None | None | 14.0 |
1 rows × 23 columns
In [568]:
df_np5[df_np5['n_practice']=='02-2016-1310']
Out[568]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6485 | 02-2016-1310 | HB-004669 | None | Hotel 4 estrelles | C ROGER DE LLURIA 17 | C | ROGER DE LLURIA | 17 | None | None | … | None | None | None | None | L’EIXAMPLE | None | la Dreta de l’Eixample | None | None | 68.0 |
1 rows × 23 columns
In [569]:
df_np6 = df_np5.copy()
MISSING VALUES¶
In [570]:
#QUICK CHECK FOR MISSING VALUES df_np6.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10286 entries, 0 to 10350 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 10286 non-null object 1 rtc 10286 non-null object 2 name 0 non-null object 3 category 10286 non-null object 4 address 10286 non-null object 5 street_type 10286 non-null object 6 street 10286 non-null object 7 street_number_1 10286 non-null object 8 street_letter_1 125 non-null object 9 street_number_2 886 non-null object 10 street_letter_2 4 non-null object 11 block 14 non-null object 12 entrance 3 non-null object 13 stair 690 non-null object 14 floor 9691 non-null object 15 door 8697 non-null object 16 district_code 0 non-null object 17 district_name 10286 non-null object 18 neighbourhood_code 0 non-null object 19 neighbourhood_name 10286 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 10270 non-null float64 dtypes: float64(1), object(22) memory usage: 1.9+ MB
N_PLACES¶
In [571]:
#CHECK WHICH CATEGORIES HAVE MISSING VALUES OF INTEREST - N PLACES df_np6[(df_np6['n_places'].isnull()) | (df_np6['n_places']== None) | (df_np6['n_places']=='nan') | (df_np6['n_places']=='')].value_counts('category')
Out[571]:
category Habitatges d'Ús Turístic 13 Pensió 3 dtype: int64
NO MODIFICATIONS MADE FOR NOW HOW AS IT MIGHT BE POSSIBLE TO RECOVER SOME OF THE DATA FROM THE OTHER TABLES
DISTRICT – NEIGHBOURHOOD¶
Comparing table with district and neighbourhood table
In [572]:
df_np6[(df_np6['district_name'].isnull()) | (df_np6['district_name']== None) | (df_np6['district_name']=='nan') | (df_np6['district_name']=='')].shape[0]
Out[572]:
9
In [573]:
df_np6[(df_np6['neighbourhood_name'].isnull()) | (df_np6['neighbourhood_name']== None) | (df_np6['neighbourhood_name']=='nan') | (df_np6['neighbourhood_name']=='')].shape[0]
Out[573]:
9
In [574]:
df_np6['district_name'].isin(df_district_neighbourhood_table['District_Name']).value_counts()
Out[574]:
False 10286 Name: district_name, dtype: int64
In [575]:
df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name']).value_counts()
Out[575]:
True 9759 False 527 Name: neighbourhood_name, dtype: int64
In [576]:
df_np6[['district_name','neighbourhood_name']][(~df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])) & (df_np6['neighbourhood_name'].notnull())].sort_values('neighbourhood_name')
Out[576]:
district_name | neighbourhood_name | |
---|---|---|
5279 | SANTS-MONTJUÏC | el Poble Sec |
6080 | SANTS-MONTJUÏC | el Poble Sec |
6079 | SANTS-MONTJUÏC | el Poble Sec |
6078 | SANTS-MONTJUÏC | el Poble Sec |
6034 | SANTS-MONTJUÏC | el Poble Sec |
… | … | … |
6312 | nan | nan |
4032 | nan | nan |
7050 | nan | nan |
4276 | nan | nan |
318 | nan | nan |
527 rows × 2 columns
In [577]:
df_district_neighbourhood_table[df_district_neighbourhood_table['Neighbourhood_Name'].str.contains('Poble')]
Out[577]:
District_Code | District_Name | Neighbourhood_Code | Neighbourhood_Name | |
---|---|---|---|---|
10 | 03 | Sants-Montjuïc | 11 | el Poble-sec |
65 | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou |
66 | 10 | Sant Martí | 67 | la Vila Olímpica del Poblenou |
67 | 10 | Sant Martí | 68 | el Poblenou |
68 | 10 | Sant Martí | 69 | Diagonal Mar i el Front Marítim del Poblenou |
70 | 10 | Sant Martí | 71 | Provençals del Poblenou |
In [578]:
df_np6.loc[df_np6['neighbourhood_name']=='el Poble Sec','neighbourhood_name'] = 'el Poble-sec'
In [579]:
df_np6[['district_name','neighbourhood_name']][(~df_np6['neighbourhood_name'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])) & (df_np6['neighbourhood_name'].notnull())].sort_values('neighbourhood_name')
Out[579]:
district_name | neighbourhood_name | |
---|---|---|
318 | nan | nan |
4032 | nan | nan |
4276 | nan | nan |
6310 | nan | nan |
6311 | nan | nan |
6312 | nan | nan |
7050 | nan | nan |
7617 | nan | nan |
7686 | nan | nan |
In [580]:
df_np6[(df_np6['neighbourhood_name'].isnull()) | (df_np6['neighbourhood_name']=='') | (df_np6['neighbourhood_name']==None)| (df_np6['neighbourhood_name']=='nan') | (df_np6['district_name'].isnull()) | (df_np6['district_name']=='') | (df_np6['district_name']==None)| (df_np6['district_name']=='nan')]
Out[580]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
318 | 10-2014-0475 | HUTB-011359 | None | Habitatges d’Ús Turístic | AV MERIDIANA 109 03 1 | AV | MERIDIANA | 109 | None | None | … | None | 03 | 1 | None | nan | None | nan | None | None | 4.0 |
4032 | 02-2013-0207 | HUTB-005185 | None | Habitatges d’Ús Turístic | C INDUSTRIA 175 02 2 | C | INDUSTRIA | 175 | None | None | … | None | 02 | 2 | None | nan | None | nan | None | None | 4.0 |
4276 | 06-2014-0281 | HUTB-009991 | None | Habitatges d’Ús Turístic | C JOSEP TORRES 26 04 2 | C | JOSEP TORRES | 26 | None | None | … | None | 04 | 2 | None | nan | None | nan | None | None | 6.0 |
6310 | 06-2013-0206 | HUTB-005620 | None | Habitatges d’Ús Turístic | C RIERA DE SANT MIQUEL 49 01 3 | C | RIERA DE SANT MIQUEL | 49 | None | None | … | None | 01 | 3 | None | nan | None | nan | None | None | 6.0 |
6311 | 06-2013-0207 | HUTB-005624 | None | Habitatges d’Ús Turístic | C RIERA DE SANT MIQUEL 49 02 1 | C | RIERA DE SANT MIQUEL | 49 | None | None | … | None | 02 | 1 | None | nan | None | nan | None | None | 6.0 |
6312 | 06-2013-0208 | HUTB-005622 | None | Habitatges d’Ús Turístic | C RIERA DE SANT MIQUEL 49 02 2 | C | RIERA DE SANT MIQUEL | 49 | None | None | … | None | 02 | 2 | None | nan | None | nan | None | None | 6.0 |
7050 | 06-2012-0286 | HUTB-002334 | None | Habitatges d’Ús Turístic | C SANT SALVADOR 20 02 1 | C | SANT SALVADOR | 20 | None | None | … | None | 02 | 1 | None | nan | None | nan | None | None | 6.0 |
7617 | 04-2014-0216 | HUTB-012140 | None | Habitatges d’Ús Turístic | C TAQUIGRAF SERRA 1 02 5 | C | TAQUIGRAF SERRA | 1 | None | None | … | None | 02 | 5 | None | nan | None | nan | None | None | 3.0 |
7686 | 02-2022-0524 | HUTB-064279 | None | Habitatges d’Ús Turístic | C TARRAGONA 84 B 5 1 | C | TARRAGONA | 84 | None | None | … | B | 5 | 1 | None | nan | None | nan | None | None | 1.0 |
9 rows × 23 columns
MANUAL INPUTATION AS MISSING RECORDS ARE ONLY 7
AV MERIDIANA 109 – neighbourhood_name: el Clot neighbourhood_code: 65
C INDUSTRIA 175 – neighbourhood_name: la Sagrada Família neighbourhood_code: 06
C JOSEP TORRES 26 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C RIERA DE SANT MIQUEL 49 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C SANT SALVADOR 20 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C TAQUIGRAF SERRA 1 – neighbourhood_name: les Corts neighbourhood_code: 19
C TARRAGONA 84 – neighbourhood_name: la Nova Esquerra de l’Eixample neighbourhood_code: 09
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [581]:
df_np7 = df_np6.copy()
In [582]:
#FILL IN MISSING VALUES df_np7.loc[df_np7['n_practice']=='10-2014-0475','neighbourhood_code'] = "65" df_np7.loc[df_np7['n_practice']=='10-2014-0475','neighbourhood_name'] = "el Clot" df_np7.loc[df_np7['n_practice']=='02-2013-0207','neighbourhood_code'] = "06" df_np7.loc[df_np7['n_practice']=='02-2013-0207','neighbourhood_name'] = "la Sagrada Família" df_np7.loc[df_np7['n_practice']=='06-2014-0281','neighbourhood_code'] = "31" df_np7.loc[df_np7['n_practice']=='06-2014-0281','neighbourhood_name'] = "la Vila de Gràcia" df_np7.loc[df_np7['n_practice']=='06-2013-0206','neighbourhood_code'] = "31" df_np7.loc[df_np7['n_practice']=='06-2013-0206','neighbourhood_name'] = "la Vila de Gràcia" df_np7.loc[df_np7['n_practice']=='06-2013-0207','neighbourhood_code'] = "31" df_np7.loc[df_np7['n_practice']=='06-2013-0207','neighbourhood_name'] = "la Vila de Gràcia" df_np7.loc[df_np7['n_practice']=='06-2013-0208','neighbourhood_code'] = "31" df_np7.loc[df_np7['n_practice']=='06-2013-0208','neighbourhood_name'] = "la Vila de Gràcia" df_np7.loc[df_np7['n_practice']=='06-2012-0286','neighbourhood_code'] = '31' df_np7.loc[df_np7['n_practice']=='06-2012-0286','neighbourhood_name'] = 'la Vila de Gràcia' df_np7.loc[df_np7['n_practice']=='04-2014-0216','neighbourhood_code'] = '19' df_np7.loc[df_np7['n_practice']=='04-2014-0216','neighbourhood_name'] = 'les Corts' df_np7.loc[df_np7['n_practice']=='02-2022-0524','neighbourhood_code'] = "09" df_np7.loc[df_np7['n_practice']=='02-2022-0524','neighbourhood_name'] = "la Nova Esquerra de l'Eixample"
In [583]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_np7['neighbourhood_code'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['Neighbourhood_Code']) df_np7['district_code'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['District_Code']) df_np7['district_name'] = df_np7['neighbourhood_name'].replace(df_district_neighbourhood_table.set_index('Neighbourhood_Name')['District_Name'])
In [584]:
#CHECK df_np7[(df_np7['neighbourhood_name'].isnull()) | (df_np7['neighbourhood_name']=='nan') | (df_np7['neighbourhood_name']==None) | (df_np7['neighbourhood_name']=='')]
Out[584]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
It might be possible to find and fill in the missing data later on when crossreferencing with other tables. Therefore, records kept incomplete for now.
In [585]:
df_np8 = df_np7.copy()
NORMALIZATION¶
In [586]:
df_np8.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10286 entries, 0 to 10350 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 10286 non-null object 1 rtc 10286 non-null object 2 name 0 non-null object 3 category 10286 non-null object 4 address 10286 non-null object 5 street_type 10286 non-null object 6 street 10286 non-null object 7 street_number_1 10286 non-null object 8 street_letter_1 125 non-null object 9 street_number_2 886 non-null object 10 street_letter_2 4 non-null object 11 block 14 non-null object 12 entrance 3 non-null object 13 stair 690 non-null object 14 floor 9691 non-null object 15 door 8697 non-null object 16 district_code 10286 non-null object 17 district_name 10286 non-null object 18 neighbourhood_code 10286 non-null object 19 neighbourhood_name 10286 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 10270 non-null float64 dtypes: float64(1), object(22) memory usage: 1.9+ MB
In [587]:
df_np8['category'].value_counts()
Out[587]:
Habitatges d'Ús Turístic 9409 Pensió 286 Hotel 4 estrelles 144 Albergs 125 Hotel 3 estrelles 119 Hotel 1 estrella 47 Hotel 2 estrelles 42 Hotel 4 estrelles superior 27 Hotel 5 estrelles 24 Hotel gran luxe 20 Hotel-Apart 4 estrelles 12 Apartaments Turístics 12 Hotel-Apart 3 estrelles 10 Hotel-Apart 2 estrelles 6 Hotel-Apart 1 estrella 2 Hotel-Apart 4 estrelles superior 1 Name: category, dtype: int64
In [588]:
df_n_places = df_np8.copy()
DATAFRAME: COORDINATES FOR TOURIST ACCOMODATIONS IN PRIVATE HOUSES – hut – E.G. AIRBNB¶
SOURCE:
The csv file provided needs cleaning. The problem is created by the neighbourhood: “Sant Pere, Santa Caterina i la Ribera”. The comma within the neighbourhood name split the corresponding column into 2: “Sant Pere” and “Santa Caterina i la Ribera”. Data for that neighbourhood is therefore moved one column on the right, making the file appear to have one more column than it should. This prevents Pandas to allocate data within the apprpriate columns.
A way to clean the data is to:
- save the file as an excel file
- split values into columns by using comma as a delimiter
- frame the data as a table
- sort all results by including only those referring to “Sant Pere, Santa Caterina i la Ribera”
- find and replace “Sant Pere” with “Sant Pere, Santa Caterina i la Ribera”
- cut and paste all data on the right of the column “Santa Caterina i la Ribera” on the same column to eliminate it and reposition the data in the correct place
In [589]:
df_hut_coordinates = pd.read_excel('2022_Hut_comunicacio_cleaned.xlsx') df_hut_coordinates.head(1)
Out[589]:
N_EXPEDIENT | CODI_DISTRICTE | DISTRICTE | CODI_BARRI | BARRI | TIPUS_CARRER | CARRER | TIPUS_NUM | NUM1 | LLETRA1 | … | LLETRA2 | BLOC | PORTAL | ESCALA | PIS | PORTA | NUMERO_REGISTRE_GENERALITAT | NUMERO_PLACES | LONGITUD_X | LATITUD_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03-2010-0437 | 3 | SANTS-MONTJUÏC | 16.0 | la Bordeta | Carrer | CONSTITUCIO | 1 | 127 | NaN | … | NaN | NaN | NaN | NaN | 4 | 2 | HUTB-003502 | 7.0 | 2.132215 | 41.367195 |
1 rows × 21 columns
In [590]:
df_hut_coordinates1 = df_hut_coordinates.copy()
EXTRACT RELEVANT DATA¶
In [591]:
#DROP COLUMNS df_hut_coordinates1.drop(columns=['TIPUS_NUM'], inplace=True)
In [592]:
#RENAME COLUMNS IN ENGLISH df_hut_coordinates1.columns = ['n_practice_hut','district_code_hut','district_name_hut','neighbourhood_code_hut', 'neighbourhood_name_hut','street_type_hut','street_hut', 'street_number_1_hut','street_letter_1_hut','street_number_2_hut','street_letter_2_hut', 'block_hut','entrance_hut','stair_hut','floor_hut','door_hut','rtc_hut','n_places_hut', 'longitude_hut','latitude_hut'] df_hut_coordinates1.head(1)
Out[592]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | street_letter_2_hut | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03-2010-0437 | 3 | SANTS-MONTJUÏC | 16.0 | la Bordeta | Carrer | CONSTITUCIO | 127 | NaN | 129.0 | NaN | NaN | NaN | NaN | 4 | 2 | HUTB-003502 | 7.0 | 2.132215 | 41.367195 |
In [593]:
#ADD COLUMN NAME WITH CATEGORY REPEATED AS THERE ARE NO NAMES FOR HUTS df_hut_coordinates1['name_hut'] = "Habitatges d'Ús Turístic" df_hut_coordinates1.head(1)
Out[593]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03-2010-0437 | 3 | SANTS-MONTJUÏC | 16.0 | la Bordeta | Carrer | CONSTITUCIO | 127 | NaN | 129.0 | … | NaN | NaN | NaN | 4 | 2 | HUTB-003502 | 7.0 | 2.132215 | 41.367195 | Habitatges d’Ús Turístic |
1 rows × 21 columns
WHITESPACES¶
In [594]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_hut_coordinates1['n_practice_hut'] = df_hut_coordinates1['n_practice_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['rtc_hut'] = df_hut_coordinates1['rtc_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['street_type_hut'] = df_hut_coordinates1['street_type_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hut_coordinates1['street_hut'] = df_hut_coordinates1['street_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hut_coordinates1['street_number_1_hut'] = df_hut_coordinates1['street_number_1_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['street_letter_1_hut'] = df_hut_coordinates1['street_letter_1_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['street_number_2_hut'] = df_hut_coordinates1['street_number_2_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['street_letter_2_hut'] = df_hut_coordinates1['street_letter_2_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['block_hut'] = df_hut_coordinates1['block_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['entrance_hut'] = df_hut_coordinates1['entrance_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['stair_hut'] = df_hut_coordinates1['stair_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['floor_hut'] = df_hut_coordinates1['floor_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['door_hut'] = df_hut_coordinates1['door_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['district_code_hut'] = df_hut_coordinates1['district_code_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['neighbourhood_code_hut'] = df_hut_coordinates1['neighbourhood_code_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['district_name_hut'] = df_hut_coordinates1['district_name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hut_coordinates1['neighbourhood_name_hut'] = df_hut_coordinates1['neighbourhood_name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hut_coordinates1['longitude_hut'] = df_hut_coordinates1['longitude_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['latitude_hut'] = df_hut_coordinates1['latitude_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hut_coordinates1['n_places_hut'] = df_hut_coordinates1['n_places_hut'].astype(float) df_hut_coordinates1['name_hut'] = df_hut_coordinates1['name_hut'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
In [595]:
#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10 df_hut_coordinates1[['district_code_hut']] = df_hut_coordinates1[['district_code_hut']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_hut_coordinates1[['neighbourhood_code_hut']] = df_hut_coordinates1[['neighbourhood_code_hut']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [596]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_hut_coordinates1 = df_hut_coordinates1.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)
In [597]:
df_hut_coordinates2 = df_hut_coordinates1.copy()
DUPLICATES¶
In [598]:
#PRELIMINARY CHECK FOR DUPLICATES df_hut_coordinates2.duplicated().value_counts()
Out[598]:
False 9409 dtype: int64
In [599]:
df_hut_coordinates2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9409 entries, 0 to 9408 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice_hut 9409 non-null object 1 district_code_hut 9409 non-null object 2 district_name_hut 9409 non-null object 3 neighbourhood_code_hut 9405 non-null object 4 neighbourhood_name_hut 9409 non-null object 5 street_type_hut 9409 non-null object 6 street_hut 9409 non-null object 7 street_number_1_hut 9409 non-null object 8 street_letter_1_hut 116 non-null object 9 street_number_2_hut 747 non-null object 10 street_letter_2_hut 3 non-null object 11 block_hut 10 non-null object 12 entrance_hut 3 non-null object 13 stair_hut 689 non-null object 14 floor_hut 9378 non-null object 15 door_hut 8528 non-null object 16 rtc_hut 9361 non-null object 17 n_places_hut 9396 non-null float64 18 longitude_hut 9409 non-null object 19 latitude_hut 9409 non-null object 20 name_hut 9409 non-null object dtypes: float64(1), object(20) memory usage: 1.5+ MB
In [600]:
df_hut_coordinates2.columns
Out[600]:
Index(['n_practice_hut', 'district_code_hut', 'district_name_hut', 'neighbourhood_code_hut', 'neighbourhood_name_hut', 'street_type_hut', 'street_hut', 'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut', 'block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'rtc_hut', 'n_places_hut', 'longitude_hut', 'latitude_hut', 'name_hut'], dtype='object')
In [601]:
#CHECK DUPLICATES BASED ON ADDRESS ONLY - EXCLUDING ID COLUMNS - 'n_practice_hut', 'rtc_hut': #TO SEE IF THERE ARE DUPLICATES ONLY IN TERMS OF ADDRESS df_hut_coordinates2[df_hut_coordinates2.duplicated(subset=['district_code_hut', 'district_name_hut', 'neighbourhood_code_hut', 'neighbourhood_name_hut', 'street_type_hut', 'street_hut', 'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut','block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut','n_places_hut','longitude_hut', 'latitude_hut'], keep=False)].shape[0]
Out[601]:
0
In [602]:
#CHECK FOR DUPLICATES BY FOCUSING ON ID COLUMN: 'rtc_hut' - EXCLUDING NULL VALUES df_hut_coordinates1[(df_hut_coordinates1.duplicated(subset=['rtc_hut'], keep=False)) & df_hut_coordinates1['rtc_hut'].notnull()].shape[0]
Out[602]:
0
In [603]:
#CHECK FOR DUPLICATES BY FOCUSING ON ID COLUMN: 'n_practice_hut' df_hut_coordinates1[df_hut_coordinates1.duplicated(subset=['n_practice_hut'], keep=False)].shape[0]
Out[603]:
0
In [604]:
df_hut_coordinates3 = df_hut_coordinates2.copy()
MISSING VALUES¶
In [605]:
#CHECK FOR MISSING VALUES df_hut_coordinates3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9409 entries, 0 to 9408 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice_hut 9409 non-null object 1 district_code_hut 9409 non-null object 2 district_name_hut 9409 non-null object 3 neighbourhood_code_hut 9405 non-null object 4 neighbourhood_name_hut 9409 non-null object 5 street_type_hut 9409 non-null object 6 street_hut 9409 non-null object 7 street_number_1_hut 9409 non-null object 8 street_letter_1_hut 116 non-null object 9 street_number_2_hut 747 non-null object 10 street_letter_2_hut 3 non-null object 11 block_hut 10 non-null object 12 entrance_hut 3 non-null object 13 stair_hut 689 non-null object 14 floor_hut 9378 non-null object 15 door_hut 8528 non-null object 16 rtc_hut 9361 non-null object 17 n_places_hut 9396 non-null float64 18 longitude_hut 9409 non-null object 19 latitude_hut 9409 non-null object 20 name_hut 9409 non-null object dtypes: float64(1), object(20) memory usage: 1.5+ MB
N_PLACES¶
In [606]:
#CHECK MISSING VALUES OF INTEREST - N PLACES df_hut_coordinates2[(df_hut_coordinates2['n_places_hut'].isnull()) | (df_hut_coordinates2['n_places_hut']=='') | (df_hut_coordinates2['n_places_hut']=='nan') | ((df_hut_coordinates2['n_places_hut']=='None'))].shape[0]
Out[606]:
13
FILLING INTO MISSING VALUES WILL BE OPERATED AFTER MERGING TABLES AS IT MIGHT BE POSSIBLE TO RETRIEVE DATA FROM THE OTHER TABLE
In [607]:
df_hut_coordinates3 = df_hut_coordinates2.copy()
DISTRICT – NEIGHBOURHOOD¶
Comparing table with district and neighbourhood table
In [608]:
#CHECK NULL VALUES df_hut_coordinates3[(df_hut_coordinates3['neighbourhood_code_hut'].isnull()) | (df_hut_coordinates3['neighbourhood_code_hut']== None) | (df_hut_coordinates3['neighbourhood_code_hut']=='') | (df_hut_coordinates3['neighbourhood_code_hut']=='')]
Out[608]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7556 | 10-2013-0467 | 10 | SANT MARTI | None | el Poblenou | nan | PIQUER | 29 | None | 37 | … | None | None | None | None | 37 | HUTB-007634 | 11.0 | 2.20676702 | 41.40025623 | Habitatges d’Ús Turístic |
9401 | 06-2012-0286 | 06 | GRACIA | None | nan | nan | SANT SALVADOR | 20 | None | None | … | None | None | None | 2 | 1 | HUTB-002334 | 6.0 | 2.152357339 | 41.40509017 | Habitatges d’Ús Turístic |
9406 | 10-2014-0475 | 10 | SANT MARTI | None | nan | nan | MERIDIANA | 109 | None | None | … | None | None | None | 3 | 1 | HUTB-011359 | 4.0 | 2.185461927 | 41.40601944 | Habitatges d’Ús Turístic |
9408 | 04-2014-0216 | 04 | LES CORTS | None | nan | nan | TAQUIGRAF SERRA | 1 | None | None | … | None | None | None | 2 | 5 | HUTB-012140 | 3.0 | 2.137418705 | 41.38397137 | Habitatges d’Ús Turístic |
4 rows × 21 columns
In [609]:
#CHECK NULL VALUES df_hut_coordinates3[(df_hut_coordinates3['neighbourhood_name_hut'].isnull()) | (df_hut_coordinates3['neighbourhood_name_hut']== None) | (df_hut_coordinates3['neighbourhood_name_hut']=='nan') | (df_hut_coordinates3['neighbourhood_name_hut']=='')].shape[0]
Out[609]:
9
In [610]:
#CHECK IF NEIGHBOURHOOD CODES ARE IN NEIGHBOURHOOD TABLE df_hut_coordinates3['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']).value_counts()
Out[610]:
True 9405 False 4 Name: neighbourhood_code_hut, dtype: int64
In [611]:
#CHECK IF NEIGHBOURHOOD NAMES ARE IN NEIGHBOURHOOD TABLE df_hut_coordinates3['neighbourhood_name_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Name']).value_counts()
Out[611]:
True 8910 False 499 Name: neighbourhood_name_hut, dtype: int64
In [612]:
#CHECK WHICH NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hut_coordinates3[(~df_hut_coordinates3['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])) & df_hut_coordinates3['neighbourhood_code_hut'].notnull()].value_counts('neighbourhood_code_hut')
Out[612]:
Series([], dtype: int64)
In [613]:
#CHECK WHICH NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hut_coordinates3[df_hut_coordinates3['neighbourhood_code_hut'].isnull()]
Out[613]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7556 | 10-2013-0467 | 10 | SANT MARTI | None | el Poblenou | nan | PIQUER | 29 | None | 37 | … | None | None | None | None | 37 | HUTB-007634 | 11.0 | 2.20676702 | 41.40025623 | Habitatges d’Ús Turístic |
9401 | 06-2012-0286 | 06 | GRACIA | None | nan | nan | SANT SALVADOR | 20 | None | None | … | None | None | None | 2 | 1 | HUTB-002334 | 6.0 | 2.152357339 | 41.40509017 | Habitatges d’Ús Turístic |
9406 | 10-2014-0475 | 10 | SANT MARTI | None | nan | nan | MERIDIANA | 109 | None | None | … | None | None | None | 3 | 1 | HUTB-011359 | 4.0 | 2.185461927 | 41.40601944 | Habitatges d’Ús Turístic |
9408 | 04-2014-0216 | 04 | LES CORTS | None | nan | nan | TAQUIGRAF SERRA | 1 | None | None | … | None | None | None | 2 | 5 | HUTB-012140 | 3.0 | 2.137418705 | 41.38397137 | Habitatges d’Ús Turístic |
4 rows × 21 columns
In [614]:
#FIND MATCHING RECORDS IN NEIGHBOURHOOD TABLE df_district_neighbourhood_table[df_district_neighbourhood_table['Neighbourhood_Name'].str.contains('Poblenou')]
Out[614]:
District_Code | District_Name | Neighbourhood_Code | Neighbourhood_Name | |
---|---|---|---|---|
65 | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou |
66 | 10 | Sant Martí | 67 | la Vila Olímpica del Poblenou |
67 | 10 | Sant Martí | 68 | el Poblenou |
68 | 10 | Sant Martí | 69 | Diagonal Mar i el Front Marítim del Poblenou |
70 | 10 | Sant Martí | 71 | Provençals del Poblenou |
In [615]:
df_hut_coordinates4 = df_hut_coordinates3.copy()
In [616]:
#REPLACE VALUES df_hut_coordinates4.loc[df_hut_coordinates4['neighbourhood_name_hut']=='el Poblenou','neighbourhood_code_hut'] = '68'
In [617]:
#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hut_coordinates4[(~df_hut_coordinates4['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']))]
Out[617]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9401 | 06-2012-0286 | 06 | GRACIA | None | nan | nan | SANT SALVADOR | 20 | None | None | … | None | None | None | 2 | 1 | HUTB-002334 | 6.0 | 2.152357339 | 41.40509017 | Habitatges d’Ús Turístic |
9406 | 10-2014-0475 | 10 | SANT MARTI | None | nan | nan | MERIDIANA | 109 | None | None | … | None | None | None | 3 | 1 | HUTB-011359 | 4.0 | 2.185461927 | 41.40601944 | Habitatges d’Ús Turístic |
9408 | 04-2014-0216 | 04 | LES CORTS | None | nan | nan | TAQUIGRAF SERRA | 1 | None | None | … | None | None | None | 2 | 5 | HUTB-012140 | 3.0 | 2.137418705 | 41.38397137 | Habitatges d’Ús Turístic |
3 rows × 21 columns
MANUAL INPUTATION AS MISSING RECORDS ARE ONLY 3
AV MERIDIANA 109 – neighbourhood_name: el Clot neighbourhood_code: 65
C SANT SALVADOR 20 – neighbourhood_name: la Vila de Gràcia neighbourhood_code: 31
C TAQUIGRAF SERRA 1 – neighbourhood_name: les Corts neighbourhood_code: 19
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [618]:
#FILL IN MISSING VALUES df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='10-2014-0475','neighbourhood_code_hut'] = '65' df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='10-2014-0475','neighbourhood_name_hut'] = 'el Clot' df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='06-2012-0286','neighbourhood_code_hut'] = '31' df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='06-2012-0286','neighbourhood_name_hut'] = 'la Vila de Gràcia' df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='04-2014-0216','neighbourhood_code_hut'] = '19' df_hut_coordinates4.loc[df_hut_coordinates4['n_practice_hut']=='04-2014-0216','neighbourhood_name_hut'] = 'les Corts'
In [619]:
#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT IN THE NEIGHBOURHOOD TABLE df_hut_coordinates4[(~df_hut_coordinates4['neighbourhood_code_hut'].isin(df_district_neighbourhood_table['Neighbourhood_Code']))]
Out[619]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut |
---|
0 rows × 21 columns
In [620]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_hut_coordinates4['neighbourhood_name_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name']) df_hut_coordinates4['district_code_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code']) df_hut_coordinates4['district_name_hut'] = df_hut_coordinates4['neighbourhood_code_hut'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])
In [621]:
#CHECK NULL VALUES df_hut_coordinates4[(df_hut_coordinates4['neighbourhood_code_hut'].isnull()) | (df_hut_coordinates4['neighbourhood_code_hut']== None) | (df_hut_coordinates4['neighbourhood_code_hut']=='nan') | (df_hut_coordinates4['neighbourhood_code_hut']=='')]
Out[621]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut |
---|
0 rows × 21 columns
In [622]:
#CHECK NULL VALUES df_hut_coordinates4[(df_hut_coordinates4['neighbourhood_name_hut'].isnull()) | (df_hut_coordinates4['neighbourhood_name_hut']== None) | (df_hut_coordinates4['neighbourhood_name_hut']=='nan') | (df_hut_coordinates4['neighbourhood_name_hut']=='')]
Out[622]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut |
---|
0 rows × 21 columns
In [623]:
df_hut_coordinates4[['district_code_hut','district_name_hut','neighbourhood_code_hut','neighbourhood_name_hut']].sort_values('neighbourhood_code_hut')
Out[623]:
district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | |
---|---|---|---|---|
224 | 01 | Ciutat Vella | 01 | el Raval |
7228 | 01 | Ciutat Vella | 01 | el Raval |
7229 | 01 | Ciutat Vella | 01 | el Raval |
7230 | 01 | Ciutat Vella | 01 | el Raval |
7231 | 01 | Ciutat Vella | 01 | el Raval |
… | … | … | … | … |
4395 | 10 | Sant Martí | 73 | la Verneda i la Pau |
4396 | 10 | Sant Martí | 73 | la Verneda i la Pau |
4397 | 10 | Sant Martí | 73 | la Verneda i la Pau |
4393 | 10 | Sant Martí | 73 | la Verneda i la Pau |
4391 | 10 | Sant Martí | 73 | la Verneda i la Pau |
9409 rows × 4 columns
In [624]:
df_hut_coordinates5 = df_hut_coordinates4.copy()
NORMALIZATION¶
In [625]:
df_hut_coordinates5.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9409 entries, 0 to 9408 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice_hut 9409 non-null object 1 district_code_hut 9409 non-null object 2 district_name_hut 9409 non-null object 3 neighbourhood_code_hut 9409 non-null object 4 neighbourhood_name_hut 9409 non-null object 5 street_type_hut 9409 non-null object 6 street_hut 9409 non-null object 7 street_number_1_hut 9409 non-null object 8 street_letter_1_hut 116 non-null object 9 street_number_2_hut 747 non-null object 10 street_letter_2_hut 3 non-null object 11 block_hut 10 non-null object 12 entrance_hut 3 non-null object 13 stair_hut 689 non-null object 14 floor_hut 9378 non-null object 15 door_hut 8528 non-null object 16 rtc_hut 9361 non-null object 17 n_places_hut 9396 non-null float64 18 longitude_hut 9409 non-null object 19 latitude_hut 9409 non-null object 20 name_hut 9409 non-null object dtypes: float64(1), object(20) memory usage: 1.5+ MB
In [626]:
#CHECK 'street_number_1' df_hut_coordinates5[df_hut_coordinates5['street_number_1_hut'].str.isdecimal()==False].value_counts('street_number_1_hut')
Out[626]:
Series([], dtype: int64)
In [627]:
#CHECK 'street_number_2' df_hut_coordinates5[df_hut_coordinates5['street_number_2_hut'].str.isdecimal()==False].value_counts('street_number_2_hut')
Out[627]:
Series([], dtype: int64)
In [628]:
df_hut = df_hut_coordinates5.copy()
DATAFRAME: COORDINATES FOR HOTELS¶
SOURCE:
In [629]:
R_ID_HOTEL = '9bccce1b-0b9d-4cc6-94a7-459cb99450de' url_hotel = 'https://opendata-ajuntament.barcelona.cat/data/api/action/datastore_search_sql?sql=SELECT%20*%20from%20%22{}%22'.format(R_ID_HOTEL) response_hotel = requests.get(url_hotel) if response_hotel.ok: data_hotel = response_hotel.json() result_hotel = data_hotel.get('result') records_hotel = result_hotel.get('records') else: print('Problem with: ', url_hotel) df_hotel_coordinates = pd.DataFrame.from_dict(records_hotel) df_hotel_coordinates.head(1)
Out[629]:
addresses_roadtype_name | addresses_end_street_number | institution_name | values_attribute_name | addresses_road_name | values_category | addresses_zip_code | secondary_filters_id | values_value | addresses_town | … | geo_epgs_25831_y | _full_text | modified | secondary_filters_asia_id | secondary_filters_fullpath | values_description | _id | addresses_neighborhood_name | values_outstanding | values_attribute_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | Centraleta | C Rambla | Telèfons | 8002 | 54731072 | 933010872 | BARCELONA | … | 4581844.702237997 | ‘+02’:36,48 ‘-000480’:6 ‘-09’:31,43 ‘-17’:32,4… | 2022-09-17T02:41:22.074915 | 65103001000004 | Planol BCN >> Allotjament >> Hotels >> Hotels … | 1 | el Barri Gòtic | True | 20003 |
1 rows × 38 columns
In [630]:
df_hotel_coordinates1 = df_hotel_coordinates.copy()
EXTRACT RELEVANT DATA¶
In [631]:
#REMOVE UNNECESSARY COLUMNS df_hotel_coordinates1.columns
Out[631]:
Index(['addresses_roadtype_name', 'addresses_end_street_number', 'institution_name', 'values_attribute_name', 'addresses_road_name', 'values_category', 'addresses_zip_code', 'secondary_filters_id', 'values_value', 'addresses_town', 'geo_epgs_4326_y', 'geo_epgs_4326_x', 'secondary_filters_name', 'secondary_filters_tree', 'addresses_district_name', 'geo_epgs_25831_x', 'addresses_start_street_number', 'register_id', 'institution_id', 'addresses_main_address', 'addresses_district_id', 'addresses_roadtype_id', 'addresses_type', 'addresses_neighborhood_id', 'values_id', 'name', 'addresses_road_id', 'created', 'geo_epgs_25831_y', '_full_text', 'modified', 'secondary_filters_asia_id', 'secondary_filters_fullpath', 'values_description', '_id', 'addresses_neighborhood_name', 'values_outstanding', 'values_attribute_id'], dtype='object')
In [632]:
#CHECK BEFORE ELIMINATING COLUMNS df_hotel_coordinates1[['created','modified','institution_name','institution_id','_full_text', 'values_id','values_attribute_name','values_category','values_value', 'values_outstanding','values_attribute_id','values_description']].head(1)
Out[632]:
created | modified | institution_name | institution_id | _full_text | values_id | values_attribute_name | values_category | values_value | values_outstanding | values_attribute_id | values_description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1996-09-17T00:00:00 | 2022-09-17T02:41:22.074915 | ‘+02’:36,48 ‘-000480’:6 ‘-09’:31,43 ‘-17’:32,4… | 136360 | Centraleta | Telèfons | 933010872 | True | 20003 |
In [633]:
#CHECK BEFORE ELIMINATING COLUMNS df_hotel_coordinates1[['_id','secondary_filters_asia_id','secondary_filters_fullpath','secondary_filters_tree', 'addresses_roadtype_name','addresses_main_address', 'addresses_type', 'addresses_road_id', 'addresses_roadtype_id','geo_epgs_25831_x', 'geo_epgs_25831_y']].head(1)
Out[633]:
_id | secondary_filters_asia_id | secondary_filters_fullpath | secondary_filters_tree | addresses_roadtype_name | addresses_main_address | addresses_type | addresses_road_id | addresses_roadtype_id | geo_epgs_25831_x | geo_epgs_25831_y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 65103001000004 | Planol BCN >> Allotjament >> Hotels >> Hotels … | 651 | True | 34308 | 430656.744033181 | 4581844.702237997 |
In [634]:
#CHECK BEFORE ELIMINATING COLUMNS df_hotel_coordinates1[['addresses_zip_code','secondary_filters_id', 'addresses_town','register_id']].head(1)
Out[634]:
addresses_zip_code | secondary_filters_id | addresses_town | register_id | |
---|---|---|---|---|
0 | 8002 | 54731072 | BARCELONA | 75990025172 |
In [635]:
#DROP COLUMNS df_hotel_coordinates1.drop(columns=['created','modified','institution_name','institution_id','_full_text', 'values_id','values_attribute_name','values_category','values_value', 'values_outstanding','values_attribute_id','values_description','_id', 'secondary_filters_asia_id','secondary_filters_fullpath', 'secondary_filters_tree','addresses_roadtype_name','addresses_main_address', 'addresses_type', 'addresses_road_id', 'addresses_roadtype_id','geo_epgs_25831_x', 'geo_epgs_25831_y','addresses_zip_code', 'secondary_filters_id','addresses_town','register_id'], inplace=True) df_hotel_coordinates1.head(1)
Out[635]:
addresses_end_street_number | addresses_road_name | geo_epgs_4326_y | geo_epgs_4326_x | secondary_filters_name | addresses_district_name | addresses_start_street_number | addresses_district_id | addresses_neighborhood_id | name | addresses_neighborhood_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 1 | 2 | Hotel Toledano – HB-000480 | el Barri Gòtic |
In [636]:
df_hotel_coordinates2 = df_hotel_coordinates1.copy()
In [637]:
#RENAME COLUMNS df_hotel_coordinates2.columns = ['street_number_2_hotel','address_hotel','longitude_hotel','latitude_hotel','category_hotel','district_name_hotel','street_number_1_hotel', 'district_code_hotel','neighbourhood_code_hotel','name_hotel','neighbourhood_name_hotel'] df_hotel_coordinates2.head(1)
Out[637]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 1 | 2 | Hotel Toledano – HB-000480 | el Barri Gòtic |
In [638]:
#FUNCTION TO EXTRACT ID COLUMN def rtc_split_hotel(arg): if "HB-" in arg: return arg.split("HB-", 1) # 1 : to split at the first found only else: return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length
In [639]:
#EXTRACT ID COLUMN RTC FROM NAME COLUMN df_hotel_coordinates2[['name_hotel','rtc_hotel']] = [rtc_split_hotel(x) for x in df_hotel_coordinates2['name_hotel']] df_hotel_coordinates2['rtc_hotel'] = 'HB-' + df_hotel_coordinates2['rtc_hotel'] df_hotel_coordinates2['rtc_hotel'] = df_hotel_coordinates2['rtc_hotel'].str.strip().replace(' ','') df_hotel_coordinates2['name_hotel'] = df_hotel_coordinates2['name_hotel'].str.replace('-',' ').str.strip() df_hotel_coordinates2.head(1)
Out[639]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 1 | 2 | Hotel Toledano | el Barri Gòtic | HB-000480 |
In [640]:
df_hotel_coordinates3 = df_hotel_coordinates2.copy()
WHITESPACES¶
In [641]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_hotel_coordinates3['rtc_hotel'] = df_hotel_coordinates3['rtc_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hotel_coordinates3['address_hotel'] = df_hotel_coordinates3['address_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hotel_coordinates3['street_number_1_hotel'] = df_hotel_coordinates3['street_number_1_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hotel_coordinates3['street_number_2_hotel'] = df_hotel_coordinates3['street_number_2_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hotel_coordinates3['district_code_hotel'] = df_hotel_coordinates3['district_code_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_hotel_coordinates3['neighbourhood_code_hotel'] = df_hotel_coordinates3['neighbourhood_code_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_hotel_coordinates3['neighbourhood_name_hotel'] = df_hotel_coordinates3['neighbourhood_name_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hotel_coordinates3['longitude_hotel'] = df_hotel_coordinates3['longitude_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hotel_coordinates3['latitude_hotel'] = df_hotel_coordinates3['latitude_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hotel_coordinates3['name_hotel'] = df_hotel_coordinates3['name_hotel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
In [642]:
#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10 df_hotel_coordinates3[['district_code_hotel']] = df_hotel_coordinates3[['district_code_hotel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_hotel_coordinates3[['neighbourhood_code_hotel']] = df_hotel_coordinates3[['neighbourhood_code_hotel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [643]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_hotel_coordinates3 = df_hotel_coordinates3.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)
In [644]:
df_hotel_coordinates4 = df_hotel_coordinates3.copy()
DUPLICATES¶
In [645]:
#QUICK CHECK FOR DUPLICATES df_hotel_coordinates4.duplicated().value_counts()
Out[645]:
False 441 dtype: int64
In [646]:
#VERIFY ID COLUMN: 'rtc_hotel' df_hotel_coordinates4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 441 entries, 0 to 440 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 street_number_2_hotel 13 non-null object 1 address_hotel 441 non-null object 2 longitude_hotel 441 non-null object 3 latitude_hotel 441 non-null object 4 category_hotel 441 non-null object 5 district_name_hotel 441 non-null object 6 street_number_1_hotel 439 non-null object 7 district_code_hotel 441 non-null object 8 neighbourhood_code_hotel 441 non-null object 9 name_hotel 441 non-null object 10 neighbourhood_name_hotel 441 non-null object 11 rtc_hotel 440 non-null object dtypes: object(12) memory usage: 41.5+ KB
In [647]:
#CHECK DUPLICATES ON ID COLUMN: 'rtc_hotel' df_hotel_coordinates4[df_hotel_coordinates4.duplicated(subset='rtc_hotel', keep=False)]
Out[647]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel |
---|
MISSING VALUES¶
In [648]:
df_hotel_coordinates4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 441 entries, 0 to 440 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 street_number_2_hotel 13 non-null object 1 address_hotel 441 non-null object 2 longitude_hotel 441 non-null object 3 latitude_hotel 441 non-null object 4 category_hotel 441 non-null object 5 district_name_hotel 441 non-null object 6 street_number_1_hotel 439 non-null object 7 district_code_hotel 441 non-null object 8 neighbourhood_code_hotel 441 non-null object 9 name_hotel 441 non-null object 10 neighbourhood_name_hotel 441 non-null object 11 rtc_hotel 440 non-null object dtypes: object(12) memory usage: 41.5+ KB
RTC¶
In [649]:
#IDENTIFY NULL VALUES IN ID COLUMN: 'rtc_hotel' df_hotel_coordinates4[(df_hotel_coordinates4['rtc_hotel'].isnull()) | (df_hotel_coordinates4['rtc_hotel']=='nan') | (df_hotel_coordinates4['rtc_hotel']==None) | (df_hotel_coordinates4['rtc_hotel']=='')]
Out[649]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
147 | 84 | Ronda de Sant Antoni | 2.163974947423502 | 41.38376584459594 | Hotels 4 estr. | Ciutat Vella | 84 | 01 | 01 | Hotel Antiga Casa Buenavista | el Raval | None |
DISTRICT – NEIGHBOURHOOD¶
In [650]:
#IDENTIFY NULL VALUES df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_code_hotel'].isnull()) | (df_hotel_coordinates4['neighbourhood_code_hotel']=='nan') | (df_hotel_coordinates4['neighbourhood_code_hotel']==None) | (df_hotel_coordinates4['neighbourhood_code_hotel']=='')].shape[0]
Out[650]:
0
In [651]:
#IDENTIFY NULL VALUES df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_name_hotel'].isnull()) | (df_hotel_coordinates4['neighbourhood_name_hotel']=='nan') | (df_hotel_coordinates4['neighbourhood_name_hotel']==None) | (df_hotel_coordinates4['neighbourhood_name_hotel']=='')].shape[0]
Out[651]:
0
In [652]:
#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_code_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[652]:
0
In [653]:
#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_name_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[653]:
0
In [654]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_hotel_coordinates4['neighbourhood_name_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name']) df_hotel_coordinates4['district_code_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code']) df_hotel_coordinates4['district_name_hotel'] = df_hotel_coordinates4['neighbourhood_code_hotel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])
In [655]:
#IDENTIFY NULL VALUES df_hotel_coordinates4[(df_hotel_coordinates4['neighbourhood_code_hotel'].isnull()) | (df_hotel_coordinates4['neighbourhood_code_hotel']=='nan') | (df_hotel_coordinates4['neighbourhood_code_hotel']==None) | (df_hotel_coordinates4['neighbourhood_code_hotel']=='')].shape[0]
Out[655]:
0
In [656]:
#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hotel_coordinates4[~df_hotel_coordinates4['neighbourhood_code_hotel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[656]:
0
In [657]:
df_hotel_coordinates4.head(1)
Out[657]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 01 | 02 | Hotel Toledano | el Barri Gòtic | HB-000480 |
In [658]:
df_hotel_coordinates5 = df_hotel_coordinates4.copy()
NORMALIZATION¶
In [659]:
#CHECK 'street_number_1' df_hotel_coordinates5[df_hotel_coordinates5['street_number_1_hotel'].str.isdecimal()==False].value_counts('street_number_1_hotel')
Out[659]:
Series([], dtype: int64)
In [660]:
#CHECK 'street_number_2' df_hotel_coordinates5[df_hotel_coordinates5['street_number_2_hotel'].str.isdecimal()==False].value_counts('street_number_2_hotel')
Out[660]:
Series([], dtype: int64)
In [661]:
df_hotel_coordinates5['category_hotel'].value_counts()
Out[661]:
Hotels 4 estr. 187 Hotels 3 estr. 121 Hotels 2 estr. 50 Hotels 5 estr. 44 Hotels 1 estr. 39 Name: category_hotel, dtype: int64
In [662]:
df_hotel = df_hotel_coordinates5.copy()
In [663]:
df_hotel.head(1)
Out[663]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 01 | 02 | Hotel Toledano | el Barri Gòtic | HB-000480 |
DATAFRAME: COORDINATES FOR HOSTELS¶
SOURCE:
https://opendata-ajuntament.barcelona.cat/data/en/dataset/allotjaments-pensions
In [664]:
url_hostel = 'https://www.bcn.cat/tercerlloc/files/allotjament/opendatabcn_allotjament_pensions-js.json' response_hostel = requests.get(url_hostel) if response_hostel.ok: data_hostel = response_hostel.json() else: print('Problem with: ', url_hostel) df_hostel_coordinates = pd.DataFrame.from_dict(data_hostel)
In [665]:
df_hostel_coordinates1 = df_hostel_coordinates.copy()
EXTRACT RELEVANT DATA¶
In [666]:
df_hostel_coordinates1.columns
Out[666]:
Index(['register_id', 'prefix', 'suffix', 'name', 'created', 'modified', 'status', 'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 'addresses', 'entity_types_data', 'attribute_categories', 'values', 'from_relationships', 'to_relationships', 'classifications_data', 'secondary_filters_data', 'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'geo_epgs_4326', 'is_section_of_data', 'sections_data', 'start_date', 'end_date', 'estimated_dates', 'languages_data', 'type', 'type_name', 'period', 'period_name', 'event_status_name', 'event_status', 'ical'], dtype='object')
In [667]:
#CHECK BEFORE DROPPING df_hostel_coordinates1[['register_id', 'prefix', 'suffix', 'created', 'modified', 'status', 'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 'from_relationships', 'to_relationships', 'timetable', 'image_data']].head(1)
Out[667]:
register_id | prefix | suffix | created | modified | status | status_name | core_type | core_type_name | body | tickets_data | from_relationships | to_relationships | timetable | image_data | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1166132202 | None | None | 2001-06-15T00:00:00+02:00 | 2022-09-17T02:23:43.565716+02:00 | published | Publicat | place | Equipament | None | [] | [] | [] | None | None |
In [668]:
#CHECK BEFORE DROPPING df_hostel_coordinates1[['gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data', 'sections_data', 'start_date', 'end_date', 'estimated_dates', 'languages_data', 'type', 'type_name', 'period', 'period_name', 'event_status_name', 'event_status', 'ical']].head(1)
Out[668]:
gallery_data | warnings | geo_epgs_25831 | geo_epgs_23031 | is_section_of_data | sections_data | start_date | end_date | estimated_dates | languages_data | type | type_name | period | period_name | event_status_name | event_status | ical | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [] | [] | {‘x’: 431059.3968618301, ‘y’: 4583099.462077796} | {‘x’: 431153.93913180125, ‘y’: 4583304.042183909} | None | [] | None | None | None | None | None | None | None | None | None | None | BEGIN:VCALENDARrnPRODID:ics.py – http://git…. |
In [669]:
#CHECK BEFORE DROPPING df_hostel_coordinates1[['secondary_filters_data','entity_types_data','attribute_categories','values']].head(1)
Out[669]:
secondary_filters_data | entity_types_data | attribute_categories | values | |
---|---|---|---|---|
0 | [{‘id’: 57245924, ‘name’: ’03. Hotels, pension… | [{‘id’: 102, ‘name’: ‘equipament’}, {‘id’: 100… | [{‘id’: 2, ‘name’: ‘Informació d’interès’, ‘… | [{‘id’: 33979, ‘value’: ‘hello@hostalin.com’, … |
In [670]:
#DROP COLUMNS df_hostel_coordinates1.drop(columns=['register_id', 'prefix', 'suffix', 'created', 'modified', 'status', 'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 'from_relationships', 'to_relationships', 'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data', 'sections_data', 'start_date', 'end_date', 'estimated_dates','languages_data', 'type', 'type_name', 'period', 'period_name','event_status_name', 'event_status', 'ical', 'secondary_filters_data','entity_types_data','attribute_categories','values'], inplace=True) df_hostel_coordinates1.head(1)
Out[670]:
name | addresses | classifications_data | geo_epgs_4326 | |
---|---|---|---|---|
0 | Hostal Hostalin Barcelona Diputació – HB-004497 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | [{‘id’: 1003005, ‘name’: ‘Pensions, hostals’, … | {‘x’: 41.3964776648101, ‘y’: 2.175311353516649} |
In [671]:
df_hostel_coordinates2 = df_hostel_coordinates1.copy()
In [672]:
#FUNCTION TO SPLIT COORDINATES def split_coordinates(arg): if "," in arg: return arg.split(",",1) # 1 : to split at the first found only else: return (arg, None) # None : to add a Null value when split character not found and so preserve the same column length
In [673]:
#SPLIT AND RENAME COLUMNS df_hostel_coordinates2['geo_epgs_4326'] = df_hostel_coordinates2['geo_epgs_4326'].astype(str) df_hostel_coordinates2[['latitude_hostel','longitude_hostel']] = [split_coordinates(x) for x in df_hostel_coordinates2['geo_epgs_4326']] df_hostel_coordinates2['latitude_hostel'] = df_hostel_coordinates2['latitude_hostel'].str.replace("{'x': ",'',regex=True) df_hostel_coordinates2['longitude_hostel'] = df_hostel_coordinates2['longitude_hostel'].str.replace("'y': ",'',regex=True).replace('}','', regex=True) df_hostel_coordinates2.drop(columns=['geo_epgs_4326'],inplace=True) df_hostel_coordinates2.head(1)
Out[673]:
name | addresses | classifications_data | latitude_hostel | longitude_hostel | |
---|---|---|---|---|---|
0 | Hostal Hostalin Barcelona Diputació – HB-004497 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | [{‘id’: 1003005, ‘name’: ‘Pensions, hostals’, … | 41.3964776648101 | 2.175311353516649 |
In [674]:
df_hostel_coordinates2.loc[0,'classifications_data']
Out[674]:
[{'id': 1003005, 'name': 'Pensions, hostals', 'full_path': 'Tipologia EQ >> Allotjament >> Pensions, hostals', 'dependency_group': 3033964, 'parent_id': 1003, 'tree_id': 1, 'asia_id': '0000102003005', 'core_type': 'place', 'level': 2}, {'id': 108215, 'name': 'Gay Friendly ', 'full_path': 'Col·lectius EQ >> Gay Friendly ', 'dependency_group': 3033964, 'parent_id': 108, 'tree_id': 108, 'asia_id': '0010801215', 'core_type': 'place', 'level': 1}]
In [675]:
classifications_data = [] for i in df_hostel_coordinates2['classifications_data']: c = list([x.get('name') for x in i])[0] #to get first item from dictionary with 'name' as key classifications_data.append(c) df_hostel_coordinates2['category_hostel'] = classifications_data df_hostel_coordinates2.drop(columns='classifications_data', inplace=True) df_hostel_coordinates2.head(1)
Out[675]:
name | addresses | latitude_hostel | longitude_hostel | category_hostel | |
---|---|---|---|---|---|
0 | Hostal Hostalin Barcelona Diputació – HB-004497 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | 41.3964776648101 | 2.175311353516649 | Pensions, hostals |
In [676]:
df_hostel_coordinates2.loc[0,'addresses']
Out[676]:
[{'place': None, 'district_name': 'Eixample', 'district_id': '02', 'neighborhood_name': "la Dreta de l'Eixample", 'neighborhood_id': '07', 'address_name': 'C Diputació', 'address_id': '100800', 'block_id': None, 'start_street_number': 346, 'end_street_number': None, 'street_number_1': '346', 'street_number_2': None, 'stairs': None, 'level': '1r', 'door': '1a', 'zip_code': '08013', 'province': 'BARCELONA', 'town': 'BARCELONA', 'country': 'ESPANYA', 'comments': None, 'position': 0, 'main_address': True, 'road_name': None, 'road_id': None, 'roadtype_name': None, 'roadtype_id': None, 'location': {'type': 'GeometryCollection', 'geometries': [{'type': 'Point', 'coordinates': [431059.3968618301, 4583099.462077796]}]}, 'related_entity': None, 'related_entity_data': None, 'hide_address': False}]
In [677]:
district_name = [] district_id = [] neighborhood_name = [] neighborhood_id = [] address_name = [] street_number_1 = [] street_number_2 = [] for i in df_hostel_coordinates2['addresses']: dn = list([x.get('district_name') for x in i])[0] dc = list([x.get('district_id') for x in i])[0] nn = list([x.get('neighborhood_name') for x in i])[0] nc = list([x.get('neighborhood_id') for x in i])[0] an = list([x.get('address_name') for x in i])[0] sn1 = list([x.get('street_number_1') for x in i])[0] sn2 = list([x.get('street_number_2') for x in i])[0] district_name.append(dn) district_id.append(dc) neighborhood_name.append(nn) neighborhood_id.append(nc) address_name.append(an) street_number_1.append(sn1) street_number_2.append(sn2) df_hostel_coordinates2['district_code_hostel'] = district_id df_hostel_coordinates2['district_name_hostel'] = district_name df_hostel_coordinates2['neighbourhood_code_hostel'] = neighborhood_id df_hostel_coordinates2['neighbourhood_name_hostel'] = neighborhood_name df_hostel_coordinates2['address_hostel'] = address_name df_hostel_coordinates2['street_number_1_hostel'] = street_number_1 df_hostel_coordinates2['street_number_2_hostel'] = street_number_2 df_hostel_coordinates2.drop(columns='addresses', inplace=True) df_hostel_coordinates2.head(1)
Out[677]:
name | latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Hostal Hostalin Barcelona Diputació – HB-004497 | 41.3964776648101 | 2.175311353516649 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 346 | None |
In [678]:
df_hostel_coordinates3 = df_hostel_coordinates2.copy()
In [679]:
#FUNCTION TO EXTRACT ID COLUMN def rtc_split_hostel(arg): if "HB-" in arg: return arg.split("HB-", 1) # 1 : to split at the first found only else: return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length
In [680]:
df_hostel_coordinates3[['name','rtc_hostel']] = [rtc_split_hostel(x) for x in df_hostel_coordinates3['name']] df_hostel_coordinates3['rtc_hostel'] = 'HB-' + df_hostel_coordinates3['rtc_hostel'] df_hostel_coordinates3['rtc_hostel'] = df_hostel_coordinates3['rtc_hostel'].str.strip().replace(' ','') df_hostel_coordinates3['name_hostel'] = df_hostel_coordinates3['name'].str.replace('-',' ').str.strip() df_hostel_coordinates3.drop(columns='name', inplace=True) df_hostel_coordinates3.head(1)
Out[680]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.3964776648101 | 2.175311353516649 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 346 | None | HB-004497 | Hostal Hostalin Barcelona Diputació |
In [681]:
df_hostel_coordinates4 = df_hostel_coordinates3.copy()
WHITESPACES¶
In [682]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_hostel_coordinates4['rtc_hostel'] = df_hostel_coordinates4['rtc_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hostel_coordinates4['address_hostel'] = df_hostel_coordinates4['address_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hostel_coordinates4['street_number_1_hostel'] = df_hostel_coordinates4['street_number_1_hostel'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hostel_coordinates4['street_number_2_hostel'] = df_hostel_coordinates4['street_number_2_hostel'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hostel_coordinates4['district_code_hostel'] = df_hostel_coordinates4['district_code_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_hostel_coordinates4['neighbourhood_code_hostel'] = df_hostel_coordinates4['neighbourhood_code_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_hostel_coordinates4['neighbourhood_name_hostel'] = df_hostel_coordinates4['neighbourhood_name_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_hostel_coordinates4['longitude_hostel'] = df_hostel_coordinates4['longitude_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hostel_coordinates4['latitude_hostel'] = df_hostel_coordinates4['latitude_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_hostel_coordinates4['name_hostel'] = df_hostel_coordinates4['name_hostel'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
In [683]:
#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10 df_hostel_coordinates4[['district_code_hostel']] = df_hostel_coordinates4[['district_code_hostel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_hostel_coordinates4[['neighbourhood_code_hostel']] = df_hostel_coordinates4[['neighbourhood_code_hostel']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [684]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_hostel_coordinates4 = df_hostel_coordinates4.applymap(lambda x: None if isinstance(x, str) and (not x or x.isspace()) else x)
In [685]:
df_hostel_coordinates5 = df_hostel_coordinates4.copy()
DUPLICATES¶
In [686]:
df_hostel_coordinates5.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_hostel 244 non-null object 1 longitude_hostel 244 non-null object 2 category_hostel 244 non-null object 3 district_code_hostel 244 non-null object 4 district_name_hostel 244 non-null object 5 neighbourhood_code_hostel 244 non-null object 6 neighbourhood_name_hostel 244 non-null object 7 address_hostel 244 non-null object 8 street_number_1_hostel 244 non-null object 9 street_number_2_hostel 0 non-null object 10 rtc_hostel 242 non-null object 11 name_hostel 244 non-null object dtypes: object(12) memory usage: 23.0+ KB
In [687]:
df_hostel_coordinates5.duplicated().value_counts()
Out[687]:
False 243 True 1 dtype: int64
In [688]:
df_hostel_coordinates5[df_hostel_coordinates5.duplicated(keep=False)]
Out[688]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
226 | 41.39527024937376 | 2.170401163741668 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Girona | 81 | None | HB-004707 | Hostal Retrome 2 |
242 | 41.39527024937376 | 2.170401163741668 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Girona | 81 | None | HB-004707 | Hostal Retrome 2 |
In [689]:
df_hostel_coordinates5 = df_hostel_coordinates5.drop_duplicates()
In [690]:
#CHECK DUPLICATES ON ID COLUMN df_hostel_coordinates5[df_hostel_coordinates5.duplicated(subset='rtc_hostel', keep=False)]
Out[690]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
221 | 41.393888004415714 | 2.171481971236943 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 327 | None | None | Hostal Bed & Break |
227 | 41.37919578389479 | 2.174445287874625 | Pensions, hostals | 01 | Ciutat Vella | 01 | el Raval | C Nou de la Rambla | 1 | None | None | Hostal Mimi Las Ramblas |
In [691]:
df_hostel_coordinates6 = df_hostel_coordinates5.copy()
MISSING VALUES¶
In [692]:
df_hostel_coordinates6.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 243 entries, 0 to 243 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_hostel 243 non-null object 1 longitude_hostel 243 non-null object 2 category_hostel 243 non-null object 3 district_code_hostel 243 non-null object 4 district_name_hostel 243 non-null object 5 neighbourhood_code_hostel 243 non-null object 6 neighbourhood_name_hostel 243 non-null object 7 address_hostel 243 non-null object 8 street_number_1_hostel 243 non-null object 9 street_number_2_hostel 0 non-null object 10 rtc_hostel 241 non-null object 11 name_hostel 243 non-null object dtypes: object(12) memory usage: 24.7+ KB
RTC¶
In [693]:
df_hostel_coordinates6[df_hostel_coordinates6['rtc_hostel'].isnull()]
Out[693]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
221 | 41.393888004415714 | 2.171481971236943 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 327 | None | None | Hostal Bed & Break |
227 | 41.37919578389479 | 2.174445287874625 | Pensions, hostals | 01 | Ciutat Vella | 01 | el Raval | C Nou de la Rambla | 1 | None | None | Hostal Mimi Las Ramblas |
DISTRICT – NEIGHBOURHOOD¶
In [694]:
#IDENTIFY NULL VALUES df_hostel_coordinates6[(df_hostel_coordinates6['neighbourhood_code_hostel'].isnull()) | (df_hostel_coordinates6['neighbourhood_code_hostel']=='nan') | (df_hostel_coordinates6['neighbourhood_code_hostel']==None) | (df_hostel_coordinates6['neighbourhood_code_hostel']=='')].shape[0]
Out[694]:
0
In [695]:
#IDENTIFY NULL VALUES df_hostel_coordinates6[(df_hostel_coordinates6['neighbourhood_name_hostel'].isnull()) | (df_hostel_coordinates6['neighbourhood_name_hostel']=='nan') | (df_hostel_coordinates6['neighbourhood_name_hostel']==None) | (df_hostel_coordinates6['neighbourhood_name_hostel']=='')].shape[0]
Out[695]:
0
In [696]:
#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_name_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[696]:
61
In [697]:
#CHECK HOW MANY NEIGHBOURHOOD CODES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_code_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[697]:
0
In [698]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_hostel_coordinates6['neighbourhood_name_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name']) df_hostel_coordinates6['district_code_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code']) df_hostel_coordinates6['district_name_hostel'] = df_hostel_coordinates6['neighbourhood_code_hostel'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])
In [699]:
#CHECK HOW MANY NEIGHBOURHOOD NAMES ARE NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_hostel_coordinates6[~df_hostel_coordinates6['neighbourhood_name_hostel'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[699]:
0
NORMALIZATION¶
In [700]:
df_hostel_coordinates7 = df_hostel_coordinates6.copy()
In [701]:
#CHECK 'street_number_1' df_hostel_coordinates7[df_hostel_coordinates7['street_number_1_hostel'].str.isdecimal()==False].value_counts('street_number_1_hostel')
Out[701]:
street_number_1_hostel 1*3 4 116*LB 1 11C 1 149*155 1 16*18 1 2*4 1 3*5 1 32*34 1 373*377 1 433*LB 1 56*58 1 77*79 1 8*10 1 83*85 1 95*97 1 98*100 1 dtype: int64
In [702]:
#CHECK 'street_number_2' df_hostel_coordinates7[df_hostel_coordinates7['street_number_2_hostel'].str.isdecimal()==False].value_counts('street_number_2_hostel')
Out[702]:
Series([], dtype: int64)
In [703]:
def split_street_number_1_2 (df_target,column_address_1,column_address_2): df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1] df = df.str.split(pat='(d+)', expand=True) df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost return print('Split Values:'), df
In [704]:
split_street_number_1_2(df_hostel_coordinates7,'street_number_1_hostel','street_number_2_hostel')
Split Values:
Out[704]:
(None, 0 1 2 3 4 14 1 * 3 25 433 *LB None None 38 1 * 3 50 16 * 18 52 3 * 5 63 2 * 4 84 116 *LB None None 99 98 * 100 110 8 * 10 124 11 C None None 141 77 * 79 157 32 * 34 158 149 * 155 198 95 * 97 201 56 * 58 207 1 * 3 208 1 * 3 225 83 * 85 239 373 * 377 )
In [705]:
#CHECK AGAIN 'street_number_1' df_hostel_coordinates7['street_number_1_hostel'] = df_hostel_coordinates7['street_number_1_hostel'].astype(str).str.strip() df_hostel_coordinates7[df_hostel_coordinates7['street_number_1_hostel'].str.isdecimal()==False].value_counts('street_number_1_hostel')
Out[705]:
Series([], dtype: int64)
In [706]:
#CHECK AGAIN 'street_number_2' df_hostel_coordinates7['street_number_2_hostel'] = df_hostel_coordinates7['street_number_2_hostel'].astype(str).str.strip() df_hostel_coordinates7[df_hostel_coordinates7['street_number_2_hostel'].str.isdecimal()==False].value_counts('street_number_2_hostel')
Out[706]:
street_number_2_hostel None 227 dtype: int64
In [707]:
df_hostel = df_hostel_coordinates7.copy()
DATAFRAME: COORDINATES FOR OTHER ESTABLISHMENTS¶
SOURCE:
https://opendata-ajuntament.barcelona.cat/data/en/dataset/allotjaments-altres
In [708]:
url_oe = 'https://www.bcn.cat/tercerlloc/files/allotjament/opendatabcn_allotjament_altres-allotjaments-js.json' response_oe = requests.get(url_oe) if response_oe.ok: data_oe = response_oe.json() else: print('Problem with: ', url_oe) df_other_establishments = pd.DataFrame.from_dict(data_oe)
In [709]:
df_other_establishments1 = df_other_establishments.copy()
EXTRACT RELEVANT DATA¶
In [710]:
#DROP UNNECESSARY COLUMNS df_other_establishments1.columns
Out[710]:
Index(['register_id', 'prefix', 'suffix', 'name', 'created', 'modified', 'status', 'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 'addresses', 'entity_types_data', 'attribute_categories', 'values', 'from_relationships', 'to_relationships', 'classifications_data', 'secondary_filters_data', 'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'geo_epgs_4326', 'is_section_of_data', 'sections_data', 'start_date', 'end_date', 'estimated_dates', 'languages_data', 'type', 'type_name', 'period', 'period_name', 'event_status_name', 'event_status', 'ical'], dtype='object')
In [711]:
#CHECK COLUMNS TO DROP df_other_establishments1[['register_id','prefix', 'suffix', 'created', 'modified','status', 'status_name', 'core_type', 'core_type_name', 'body','tickets_data','entity_types_data', 'attribute_categories', 'values']].head(1)
Out[711]:
register_id | prefix | suffix | created | modified | status | status_name | core_type | core_type_name | body | tickets_data | entity_types_data | attribute_categories | values | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75990030639 | None | None | 1996-09-17T00:00:00+02:00 | 2022-09-17T02:42:49.222463+02:00 | published | Publicat | place | Equipament | None | [] | [{‘id’: 102, ‘name’: ‘equipament’}, {‘id’: 100… | [{‘id’: 2, ‘name’: ‘Informació d’interès’, ‘… | [{‘id’: 34989, ‘value’: ‘calabria@city-hotels…. |
In [712]:
#CHECK COLUMNS TO DROP df_other_establishments1[['from_relationships', 'to_relationships', 'timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data', 'sections_data', 'start_date', 'end_date', 'estimated_dates']].head(1)
Out[712]:
from_relationships | to_relationships | timetable | image_data | gallery_data | warnings | geo_epgs_25831 | geo_epgs_23031 | is_section_of_data | sections_data | start_date | end_date | estimated_dates | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [] | [] | None | None | [] | [] | {‘x’: 429195.2433352358, ‘y’: 4581395.340320568} | {‘x’: 429289.7989668916, ‘y’: 4581599.905278732} | None | [] | None | None | None |
In [713]:
#CHECK COLUMNS TO DROP df_other_establishments1[[ 'secondary_filters_data','languages_data', 'type','type_name', 'period', 'period_name','event_status_name', 'event_status', 'ical']].head(1)
Out[713]:
secondary_filters_data | languages_data | type | type_name | period | period_name | event_status_name | event_status | ical | |
---|---|---|---|---|---|---|---|---|---|
0 | [{‘id’: 57245924, ‘name’: ’03. Hotels, pension… | None | None | None | None | None | None | None | BEGIN:VCALENDARrnPRODID:ics.py – http://git…. |
In [714]:
#DROP COLUMNS df_other_establishments1.drop(columns=['register_id','prefix', 'suffix', 'created', 'modified', 'status', 'status_name', 'core_type', 'core_type_name', 'body', 'tickets_data', 'from_relationships', 'to_relationships','timetable', 'image_data', 'gallery_data', 'warnings', 'geo_epgs_25831', 'geo_epgs_23031', 'is_section_of_data','sections_data', 'start_date', 'end_date', 'estimated_dates', 'languages_data', 'type', 'type_name', 'period', 'period_name', 'event_status_name', 'event_status', 'ical','values', 'entity_types_data','attribute_categories','secondary_filters_data'], inplace=True) df_other_establishments1.head(1)
Out[714]:
name | addresses | classifications_data | geo_epgs_4326 | |
---|---|---|---|---|
0 | Apartament TurÃstic Atenea Calabria – ATB-000001 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | [{‘id’: 1003011, ‘name’: ‘Apartaments turÃsti… | {‘x’: 41.38096727220521, ‘y’: 2.1532132712982266} |
In [715]:
df_other_establishments2 = df_other_establishments1.copy()
In [716]:
#FUNCTION TO SPLIT THE COORDINATE COLUMNS INTO TWO SEPARATE COLUMNS def split_coordinates(arg): if "," in arg: return arg.split(",",1) # 1 : to split at the first found only else: return (arg, None) # None : to add a Null value when split character not found and so preserve the same column length
In [717]:
df_other_establishments2['geo_epgs_4326'] = df_other_establishments2['geo_epgs_4326'].astype(str) df_other_establishments2[['latitude_oe','longitude_oe']] = [split_coordinates(x) for x in df_other_establishments2['geo_epgs_4326']] df_other_establishments2['latitude_oe'] = df_other_establishments2['latitude_oe'].str.replace("{'x': ",'',regex=True) df_other_establishments2['longitude_oe'] = df_other_establishments2['longitude_oe'].str.replace("'y': ",'',regex=True).replace('}','', regex=True) df_other_establishments2.drop(columns=['geo_epgs_4326'],inplace=True) df_other_establishments2.head(1)
Out[717]:
name | addresses | classifications_data | latitude_oe | longitude_oe | |
---|---|---|---|---|---|
0 | Apartament TurÃstic Atenea Calabria – ATB-000001 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | [{‘id’: 1003011, ‘name’: ‘Apartaments turÃsti… | 41.38096727220521 | 2.1532132712982266 |
In [718]:
#EXTRACT A CLASSIFICATION COLUMNS FROM THE CLASSIFICATION_DATA COLUMN - THE COLUMN NEEDS TO BE IN OBJECT TYPE FOR THE EXTRACTION TO WORK df_other_establishments2.loc[0,'classifications_data']
Out[718]:
[{'id': 1003011, 'name': 'Apartaments turÃxadstics', 'full_path': 'Tipologia EQ >> Allotjament >> Apartaments turÃxadstics', 'dependency_group': 3033964, 'parent_id': 1003, 'tree_id': 1, 'asia_id': '0000102003011', 'core_type': 'place', 'level': 2}, {'id': 28793722, 'name': '3 estrelles', 'full_path': 'Categories >> Estrelles >> 3 estrelles', 'dependency_group': 3033964, 'parent_id': 103001, 'tree_id': 103, 'asia_id': '0010302001003', 'core_type': 'place', 'level': 2}, {'id': 105001, 'name': 'Accessible per a persones amb discapacitat fÃxadsica', 'full_path': 'Accessibilitat >> Accessible per a persones amb discapacitat fÃxadsica', 'dependency_group': 3033964, 'parent_id': 105, 'tree_id': 105, 'asia_id': '0010501001', 'core_type': 'place', 'level': 1}, {'id': 72314191, 'name': 'Hospedaje en aparta-hoteles', 'full_path': 'Arbre Principal Barcelona Activa >> Industria, comerc i serveis >> Comercio, rest.y hospedajes reparaciones >> Servicio de hospedaje >> Hospedaje en aparta-hoteles', 'dependency_group': 2206253, 'parent_id': 27751630, 'tree_id': 129, 'asia_id': '0012904000006008004', 'core_type': 'event', 'level': 4}, {'id': 68601655, 'name': 'Guardia y custodia vehiculos en parkings', 'full_path': 'Arbre Principal Barcelona Activa >> Industria, comerc i serveis >> Transporte y comunicaciones >> Actividades anexas a los transportes >> Actividades anexas transporte terrestre >> Guardia y custodia vehiculos en parkings', 'dependency_group': 2206253, 'parent_id': 73311188, 'tree_id': 129, 'asia_id': '0012905000007005001002', 'core_type': 'event', 'level': 5}]
In [719]:
classifications_data = [] for i in df_other_establishments2['classifications_data']: c = list([x.get('name') for x in i])[0] #to get first item from dictionary with 'name' as key classifications_data.append(c) df_other_establishments2['category_oe'] = classifications_data df_other_establishments2.drop(columns=['classifications_data'], inplace=True) df_other_establishments2.head(1)
Out[719]:
name | addresses | latitude_oe | longitude_oe | category_oe | |
---|---|---|---|---|---|
0 | Apartament TurÃstic Atenea Calabria – ATB-000001 | [{‘place’: None, ‘district_name’: ‘Eixample’, … | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics |
In [720]:
df_other_establishments2.loc[0,'addresses']
Out[720]:
[{'place': None, 'district_name': 'Eixample', 'district_id': '02', 'neighborhood_name': "la Nova Esquerra de l'Eixample", 'neighborhood_id': '09', 'address_name': 'C CalÃxa0bria', 'address_id': '054509', 'block_id': None, 'start_street_number': 129, 'end_street_number': None, 'street_number_1': '129', 'street_number_2': None, 'stairs': None, 'level': None, 'door': None, 'zip_code': '08015', 'province': 'BARCELONA', 'town': 'BARCELONA', 'country': 'ESPANYA', 'comments': None, 'position': 0, 'main_address': True, 'road_name': None, 'road_id': None, 'roadtype_name': None, 'roadtype_id': None, 'location': {'type': 'GeometryCollection', 'geometries': [{'type': 'Point', 'coordinates': [429195.2433352358, 4581395.340320568]}]}, 'related_entity': None, 'related_entity_data': None, 'hide_address': False}]
In [721]:
district_name = [] district_id = [] neighborhood_name = [] neighborhood_id = [] address_name = [] street_number_1 = [] street_number_2 = [] for i in df_other_establishments2['addresses']: dn = list([x.get('district_name') for x in i])[0] dc = list([x.get('district_id') for x in i])[0] nn = list([x.get('neighborhood_name') for x in i])[0] nc = list([x.get('neighborhood_id') for x in i])[0] an = list([x.get('address_name') for x in i])[0] sn1 = list([x.get('street_number_1') for x in i])[0] sn2 = list([x.get('street_number_2') for x in i])[0] district_name.append(dn) district_id.append(dc) neighborhood_name.append(nn) neighborhood_id.append(nc) address_name.append(an) street_number_1.append(sn1) street_number_2.append(sn2) df_other_establishments2['district_code_oe'] = district_id df_other_establishments2['district_name_oe'] = district_name df_other_establishments2['neighbourhood_code_oe'] = neighborhood_id df_other_establishments2['neighbourhood_name_oe'] = neighborhood_name df_other_establishments2['address_name_oe'] = address_name df_other_establishments2['street_number_1_oe'] = street_number_1 df_other_establishments2['street_number_2_oe'] = street_number_2 df_other_establishments2.drop(columns='addresses', inplace=True) df_other_establishments2.head(1)
Out[721]:
name | latitude_oe | longitude_oe | category_oe | district_code_oe | district_name_oe | neighbourhood_code_oe | neighbourhood_name_oe | address_name_oe | street_number_1_oe | street_number_2_oe | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Apartament TurÃstic Atenea Calabria – ATB-000001 | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Calà bria | 129 | None |
In [722]:
df_other_establishments2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 206 entries, 0 to 205 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 206 non-null object 1 latitude_oe 206 non-null object 2 longitude_oe 206 non-null object 3 category_oe 206 non-null object 4 district_code_oe 206 non-null object 5 district_name_oe 206 non-null object 6 neighbourhood_code_oe 206 non-null object 7 neighbourhood_name_oe 206 non-null object 8 address_name_oe 206 non-null object 9 street_number_1_oe 206 non-null object 10 street_number_2_oe 0 non-null object dtypes: object(11) memory usage: 17.8+ KB
In [723]:
df_other_establishments2['category_oe'].value_counts()
Out[723]:
Albergs 121 Residències 61 Apartaments turÃstics 12 Col·legis majors 11 Empreses de serveis 1 Name: category_oe, dtype: int64
In [724]:
df_other_establishments3 = df_other_establishments2.copy()
In [725]:
#RTC ID CODE IS INCLUDED IN MOST NAMES, ADDED AT THE END FOLLOWING A "-" AS SEPARATING CHARACTER df_other_establishments3['name']
Out[725]:
0 Apartament TurÃstic Atenea Calabria - ATB-000001 1 Col.legi Major Penyafort - Montserrat 2 Col.legi Major Bonaigua 3 Residència Università ria Mare Anna Ravell 4 Centre dâAcollida del Baix Guinardó per a l... ... 201 Residència Salesiana Martà Codolar 202 Residència d'Estudiants Mare Güell 203 Residència Erasmus Grà cia 204 Residència d'Estudiants LIV Student Sarrià 205 Residència Università ria Josep Manyanet Name: name, Length: 206, dtype: object
In [726]:
#3 RESIDENCIES, 1 COLLEGIS MAJOR AND EMPRESES DE SERVEIS APPEAR TO HAVE THE SEPARATING CHARACTER df_other_establishments3[['name','category_oe']][(df_other_establishments3['name'].str.contains(' - ')) & (df_other_establishments3['category_oe'] != 'Albergs') & (df_other_establishments3['category_oe'] != 'Apartaments turÃstics')]
Out[726]:
name | category_oe | |
---|---|---|
1 | Col.legi Major Penyafort – Montserrat | Col·legis majors |
91 | Residència d’Estudiants Vita Student – Poblenou | Residències |
138 | Residència d’Estudiants Vita Student – Pedralbes | Residències |
179 | Residència per a Investigadors CSIC – General… | Residències |
In [727]:
#HOWEVER, LIKE THE OTHERS, IT IS NOT FOLLOWED BY AN RTC CODE df_other_establishments3[['name','category_oe']][(df_other_establishments3['category_oe'] != 'Albergs') & (df_other_establishments3['category_oe'] != 'Apartaments turÃstics')]
Out[727]:
name | category_oe | |
---|---|---|
1 | Col.legi Major Penyafort – Montserrat | Col·legis majors |
2 | Col.legi Major Bonaigua | Col·legis majors |
3 | Residència Università ria Mare Anna Ravell | Residències |
4 | Centre dâAcollida del Baix Guinardó per a l… | Residències |
12 | Casa Sant Felip Neri | Residències |
… | … | … |
201 | Residència Salesiana Martà Codolar | Residències |
202 | Residència d’Estudiants Mare Güell | Residències |
203 | Residència Erasmus Grà cia | Residències |
204 | Residència d’Estudiants LIV Student Sarrià | Residències |
205 | Residència Università ria Josep Manyanet | Residències |
73 rows × 2 columns
In [728]:
#AMONG ALBERGS AND APARTMENTS TURISTICS - ONLY 1 APARTMENTS TURISTICS APPEARS TO BE MISSING THE SEPARATING CHARACTER df_other_establishments3[['name','category_oe']][(~df_other_establishments3['name'].str.contains(' - ')) & ((df_other_establishments3['category_oe'] == 'Albergs') | (df_other_establishments3['category_oe'] == 'Apartaments turÃstics'))].value_counts('category_oe')
Out[728]:
category_oe Apartaments turÃstics 1 dtype: int64
In [729]:
#IN THIS CASE, THE SEPARATING CHARACTER IS FOLLOWED BY AN RTC CODE df_other_establishments3[['name','category_oe']][(~df_other_establishments3['name'].str.contains(' - ')) & ((df_other_establishments3['category_oe'] == 'Albergs') | (df_other_establishments3['category_oe'] == 'Apartaments turÃstics'))]
Out[729]:
name | category_oe | |
---|---|---|
53 | Apartament TurÃstic Midtown Apartments- ATB-0… | Apartaments turÃstics |
In [730]:
#ALBERGS AND APARTMENTS TURISTIC ALL HAVE A "-" SEPARATING CHARACTER, FOLLOWED BY AN RTC CODE - HOWEVER TO AVOID SPLITTING RECORDS IN OTHER CATEGORIES, IT IS BEST TO USE "- A" AS SPLITTING CHARACTER AND THEN REINSTATE THE A df_other_establishments3[['name','category_oe']][(df_other_establishments3['name'].str.contains(' - ')) & ((df_other_establishments3['category_oe'] == 'Albergs') | (df_other_establishments3['category_oe'] == 'Apartaments turÃstics'))].sort_values('category_oe')
Out[730]:
name | category_oe | |
---|---|---|
79 | Alberg Fabrizzios Terrace Barcelona – AJ000615 | Albergs |
114 | Primavera Hostel – AJ000579 | Albergs |
113 | Alberg Campus del Mar – AJ000520 | Albergs |
112 | Alberg Coroleu House – AJ000571 | Albergs |
111 | Alberg Casa Kessler Barcelona B – AJ000582 | Albergs |
… | … | … |
16 | Apartament TurÃstic Tibidabo Apartments – ATB… | Apartaments turÃstics |
89 | Apartament TurÃstic Descartes – ATB-000044 | Apartaments turÃstics |
54 | Apartament TurÃstic Hostemplo – ATB-000089 | Apartaments turÃstics |
186 | Apartament TurÃstic DV – ATB-000083 | Apartaments turÃstics |
0 | Apartament TurÃstic Atenea Calabria – ATB-000001 | Apartaments turÃstics |
132 rows × 2 columns
In [731]:
df_other_establishments4 = df_other_establishments3.copy()
In [732]:
#FUNCTION TO EXTRACT ID COLUMN def rtc_split_oe(arg): if "- A" in arg: return arg.split("- A", 1) # 1 : to split at the first found only else: return [arg, None] # None : to add a Null value when split character not found and so preserve the same column length
In [733]:
df_other_establishments4['name'] = df_other_establishments4['name'].astype(str) df_other_establishments4[['name_oe','rtc_oe']] = [rtc_split_oe(x) for x in df_other_establishments4['name']] df_other_establishments4['name_oe'] = df_other_establishments4['name_oe'].str.strip() df_other_establishments4['rtc_oe'] = "A" + df_other_establishments4['rtc_oe'] #REINSTATE THE 'A' df_other_establishments4['rtc_oe'] = df_other_establishments4['rtc_oe'].str.strip() df_other_establishments4.drop(columns=['name'],inplace=True) df_other_establishments4.head(1)
Out[733]:
latitude_oe | longitude_oe | category_oe | district_code_oe | district_name_oe | neighbourhood_code_oe | neighbourhood_name_oe | address_name_oe | street_number_1_oe | street_number_2_oe | name_oe | rtc_oe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Calà bria | 129 | None | Apartament TurÃstic Atenea Calabria | ATB-000001 |
In [734]:
#COMPARING THE INITIAL RESULTS df_other_establishments4['category_oe'].value_counts()
Out[734]:
Albergs 121 Residències 61 Apartaments turÃstics 12 Col·legis majors 11 Empreses de serveis 1 Name: category_oe, dtype: int64
In [735]:
#NOW THE RTC IS NOT PRESENT IN THESE CATEGORIES df_other_establishments4['category_oe'][df_other_establishments4['rtc_oe'].isnull()].value_counts()
Out[735]:
Residències 61 Col·legis majors 11 Empreses de serveis 1 Name: category_oe, dtype: int64
In [736]:
#AND IT IS PRESENT IN THESE CATEGORIES df_other_establishments4['category_oe'][df_other_establishments4['rtc_oe'].notnull()].value_counts()
Out[736]:
Albergs 121 Apartaments turÃstics 12 Name: category_oe, dtype: int64
THE FOCUS OF THIS ANALYSIS IS ON ESTABLISHMENTS FOR T0URISTS
THEREFORE ONLY ALBERGS AND APARTMENTS TURISTICS WILL BE RETAINED
In [737]:
df_other_establishments5 = df_other_establishments4.copy()
In [738]:
#DATAFRAME FOR "Apartaments turÃstics" - at df_touristapartment_coordinates = df_other_establishments5[df_other_establishments5['category_oe']=='Apartaments turÃstics'] df_touristapartment_coordinates['category_oe'].value_counts()
Out[738]:
Apartaments turÃstics 12 Name: category_oe, dtype: int64
In [739]:
#DATAFRAME FOR "ALBERGS" - al df_alberg_coordinates = df_other_establishments5[df_other_establishments5['category_oe']=='Albergs'] df_alberg_coordinates['category_oe'].value_counts()
Out[739]:
Albergs 121 Name: category_oe, dtype: int64
DATAFRAME: COORDINATES FOR “APARTAMENTS TURISTICS” – _touristapartment (FROM OTHER ESTABLISHMENTS)¶
SOURCE:
SECTION OTHER ESTABLISHMENT: df_at
In [740]:
df_at = df_touristapartment_coordinates.copy()
In [741]:
df_at.columns
Out[741]:
Index(['latitude_oe', 'longitude_oe', 'category_oe', 'district_code_oe', 'district_name_oe', 'neighbourhood_code_oe', 'neighbourhood_name_oe', 'address_name_oe', 'street_number_1_oe', 'street_number_2_oe', 'name_oe', 'rtc_oe'], dtype='object')
In [742]:
#RENAME COLUMNS df_at.columns = ['latitude_at', 'longitude_at', 'category_at', 'district_code_at', 'district_name_at', 'neighbourhood_code_at', 'neighbourhood_name_at', 'address_name_at', 'street_number_1_at', 'street_number_2_at', 'name_at', 'rtc_at'] df_at.head(1)
Out[742]:
latitude_at | longitude_at | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Calà bria | 129 | None | Apartament TurÃstic Atenea Calabria | ATB-000001 |
WHITESPACES¶
In [743]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_at['rtc_at'] = df_at['rtc_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_at['category_at'] = df_at['category_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_at['address_name_at'] = df_at['address_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_at['street_number_1_at'] = df_at['street_number_1_at'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_at['street_number_2_at'] = df_at['street_number_2_at'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_at['district_code_at'] = df_at['district_code_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_at['district_name_at'] = df_at['district_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_at['neighbourhood_code_at'] = df_at['neighbourhood_code_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_at['neighbourhood_name_at'] = df_at['neighbourhood_name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_at['longitude_at'] = df_at['longitude_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_at['latitude_at'] = df_at['latitude_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_at['name_at'] = df_at['name_at'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
In [744]:
#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10 df_at[['district_code_at']] = df_at[['district_code_at']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_at[['neighbourhood_code_at']] = df_at[['neighbourhood_code_at']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [745]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_at = df_at.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)
In [746]:
df_at.head(1)
Out[746]:
latitude_at | longitude_at | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Calà bria | 129 | None | Apartament TurÃstic Atenea Calabria | ATB-000001 |
In [747]:
df_at1 = df_at.copy()
DUPLICATES¶
In [748]:
#PRELIMINARY CHECK df_at1.duplicated().value_counts()
Out[748]:
False 12 dtype: int64
In [749]:
#CHECK ON ID COLUMN df_at1[df_at1.duplicated(subset=['rtc_at'], keep=False)]
Out[749]:
latitude_at | longitude_at | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at |
---|
MISSING VALUES¶
In [750]:
df_at1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 12 entries, 0 to 195 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_at 12 non-null object 1 longitude_at 12 non-null object 2 category_at 12 non-null object 3 district_code_at 12 non-null object 4 district_name_at 12 non-null object 5 neighbourhood_code_at 12 non-null object 6 neighbourhood_name_at 12 non-null object 7 address_name_at 12 non-null object 8 street_number_1_at 12 non-null object 9 street_number_2_at 0 non-null object 10 name_at 12 non-null object 11 rtc_at 12 non-null object dtypes: object(12) memory usage: 1.2+ KB
In [751]:
df_at1['category_at'].value_counts()
Out[751]:
Apartaments turÃstics 12 Name: category_at, dtype: int64
RTC¶
In [752]:
#IDENTIFY NULL VALUES df_at1[(df_at1['rtc_at'].isnull()) | (df_at1['rtc_at']=='nan') | (df_at1['rtc_at']==None) | (df_at1['rtc_at']=='')].shape[0]
Out[752]:
0
DISTRICT – NEIGHBOURHOOD¶
In [753]:
#IDENTIFY NULL VALUES df_at1[(df_at1['neighbourhood_code_at'].isnull()) | (df_at1['neighbourhood_code_at']=='nan') | (df_at1['neighbourhood_code_at']==None) | (df_at1['neighbourhood_code_at']=='')].shape[0]
Out[753]:
0
In [754]:
#IDENTIFY NULL VALUES df_at1[(df_at1['neighbourhood_name_at'].isnull()) | (df_at1['neighbourhood_name_at']=='nan') | (df_at1['neighbourhood_name_at']==None) | (df_at1['neighbourhood_name_at']=='')].shape[0]
Out[754]:
0
In [755]:
#CHECK IF NEIGHBOURHOOD NAMES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE df_at1[~df_at1['neighbourhood_name_at'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[755]:
5
In [756]:
#CHECK IF NEIGHBOURHOOD CODES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE df_at1[~df_at1['neighbourhood_code_at'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[756]:
0
In [757]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_at1['neighbourhood_name_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name']) df_at1['district_code_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code']) df_at1['district_name_at'] = df_at1['neighbourhood_code_at'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])
In [758]:
#CHECK IF NEIGHBOURHOOD NAMES ARE COMPATIBLE WITH NEIGHBOURHOOD TABLE df_at1[~df_at1['neighbourhood_name_at'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[758]:
0
In [759]:
df_at2 = df_at1.copy()
NORMALIZATION¶
In [760]:
df_at2.columns
Out[760]:
Index(['latitude_at', 'longitude_at', 'category_at', 'district_code_at', 'district_name_at', 'neighbourhood_code_at', 'neighbourhood_name_at', 'address_name_at', 'street_number_1_at', 'street_number_2_at', 'name_at', 'rtc_at'], dtype='object')
In [761]:
#CHECK 'street_number_1' df_at2[df_at2['street_number_1_at'].str.isdecimal()==False].value_counts('street_number_1_at')
Out[761]:
street_number_1_at 276*280 1 dtype: int64
In [762]:
#CHECK 'street_number_2' df_at2[df_at2['street_number_2_at'].str.isdecimal()==False].value_counts('street_number_2_at')
Out[762]:
Series([], dtype: int64)
In [763]:
def split_street_number_1_2 (df_target,column_address_1,column_address_2): df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1] df = df.str.split(pat='(d+)', expand=True) df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost return print('Split Values:'), df
In [764]:
split_street_number_1_2(df_at2,'street_number_1_at','street_number_2_at')
Split Values:
Out[764]:
(None, 0 1 2 3 4 54 276 * 280 )
In [765]:
#CHECK AGAIN 'street_number_1' df_at2[df_at2['street_number_1_at'].str.isdecimal()==False].value_counts('street_number_1_at')
Out[765]:
Series([], dtype: int64)
In [766]:
#CHECK AGAIN 'street_number_2' df_at2[df_at2['street_number_2_at'].str.isdecimal()==False].value_counts('street_number_2_at')
Out[766]:
Series([], dtype: int64)
In [767]:
df_at2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 12 entries, 0 to 195 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_at 12 non-null object 1 longitude_at 12 non-null object 2 category_at 12 non-null object 3 district_code_at 12 non-null object 4 district_name_at 12 non-null object 5 neighbourhood_code_at 12 non-null object 6 neighbourhood_name_at 12 non-null object 7 address_name_at 12 non-null object 8 street_number_1_at 12 non-null object 9 street_number_2_at 1 non-null object 10 name_at 12 non-null object 11 rtc_at 12 non-null object dtypes: object(12) memory usage: 1.2+ KB
In [768]:
df_touristapartment = df_at2.copy()
DATAFRAME: COORDINATES FOR “ALBERGS” – alberg (FROM OTHER ESTABLISHMENTS)¶
SOURCE:
SECTION OTHER ESTABLISHMENT: df_albergs
In [769]:
df_al = df_alberg_coordinates.copy()
In [770]:
df_al.columns
Out[770]:
Index(['latitude_oe', 'longitude_oe', 'category_oe', 'district_code_oe', 'district_name_oe', 'neighbourhood_code_oe', 'neighbourhood_name_oe', 'address_name_oe', 'street_number_1_oe', 'street_number_2_oe', 'name_oe', 'rtc_oe'], dtype='object')
In [771]:
#RENAME COLUMNS df_al.columns = ['latitude_al', 'longitude_al', 'category_al', 'district_code_al', 'district_name_al', 'neighbourhood_code_al', 'neighbourhood_name_al', 'address_name_al', 'street_number_1_al', 'street_number_2_al', 'name_al', 'rtc_al'] df_al.head(1)
Out[771]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martà | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | Alberg Arc House | AJ000645 |
WHITESPACES¶
In [772]:
#REMOVING SPACES # .replace(' ','', regex=True) - replace all spaces with nothing # .str.strip() - replace 1 initial and 1 trailing space only # .replace(r's+',' ', regex=True) - replace multiple spaces with one single space # .replace(r'^s+|s+$','',regex=True) - replace all + spaces s starting ^ and trailing $ # .replace('nan','', regex=True) - replace pre-existing 'nan' strings into empty cells - not to be used for string columns potentially containing nan as subpart of string # .replace('.0','',regex=True) - replace .0 with nothing - '' is required to assign '.' as a normal character and not as a special one df_al['rtc_al'] = df_al['rtc_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_al['category_al'] = df_al['category_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_al['address_name_al'] = df_al['address_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_al['street_number_1_al'] = df_al['street_number_1_al'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_al['street_number_2_al'] = df_al['street_number_2_al'].astype(str).replace(r' ','',regex=True).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_al['district_code_al'] = df_al['district_code_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_al['district_name_al'] = df_al['district_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_al['neighbourhood_code_al'] = df_al['neighbourhood_code_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('.0','',regex=True) df_al['neighbourhood_name_al'] = df_al['neighbourhood_name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True) df_al['longitude_al'] = df_al['longitude_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_al['latitude_al'] = df_al['latitude_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True).replace('nan','', regex=True).replace('None','', regex=True) df_al['name_al'] = df_al['name_al'].astype(str).replace(r's+',' ',regex=True).replace(r'^s+|s+$','',regex=True)
In [773]:
#DISTRICT AND NEIGHBOURHOOD CODES NEED TO BE IN STRING FORMAT AND REQUIRE AN ADDED '0' IN FRONT OF ALL NUMBERS BELOW 10 df_al[['district_code_al']] = df_al[['district_code_al']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x]) df_al[['neighbourhood_code_al']] = df_al[['neighbourhood_code_al']].apply(lambda x: ['0{}'.format(y) if len(y) == 1 else y for y in x])
In [774]:
#REPLACE CELL THAT IS ENTIRELY SPACE OR EMPTY with None df_al = df_al.applymap(lambda x: None if isinstance(x, str) and (x=='' or x.isspace()) else x)
In [775]:
df_al1 = df_al.copy()
DUPLICATES¶
In [776]:
#PRELIMINARY CHECK df_al1.duplicated().value_counts()
Out[776]:
False 121 dtype: int64
In [777]:
#CHECK ON ID COLUMN df_al1[df_al1.duplicated(subset=['rtc_al'], keep=False)]
Out[777]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al |
---|
MISSING VALUES¶
In [778]:
df_al1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 121 entries, 6 to 200 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_al 121 non-null object 1 longitude_al 121 non-null object 2 category_al 121 non-null object 3 district_code_al 121 non-null object 4 district_name_al 121 non-null object 5 neighbourhood_code_al 121 non-null object 6 neighbourhood_name_al 121 non-null object 7 address_name_al 121 non-null object 8 street_number_1_al 121 non-null object 9 street_number_2_al 0 non-null object 10 name_al 121 non-null object 11 rtc_al 121 non-null object dtypes: object(12) memory usage: 12.3+ KB
RTC¶
In [779]:
#IDENTIFY NULL VALUES df_al1[(df_al1['rtc_al'].isnull()) | (df_al1['rtc_al']=='nan') | (df_al1['rtc_al']==None) | (df_al1['rtc_al']=='')].shape[0]
Out[779]:
0
DISTRICT – NEIGHBOURHOOD¶
In [780]:
#IDENTIFY NULL VALUES df_al1[(df_al1['neighbourhood_code_al'].isnull()) | (df_al1['neighbourhood_code_al']=='nan') | (df_al1['neighbourhood_code_al']==None) | (df_al1['neighbourhood_code_al']=='')].shape[0]
Out[780]:
0
In [781]:
#IDENTIFY NULL VALUES df_al1[(df_al1['neighbourhood_name_al'].isnull()) | (df_al1['neighbourhood_name_al']=='nan') | (df_al1['neighbourhood_name_al']==None) | (df_al1['neighbourhood_name_al']=='')].shape[0]
Out[781]:
0
In [782]:
#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_al1[~df_al1['neighbourhood_name_al'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[782]:
17
In [783]:
#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_al1[~df_al1['neighbourhood_code_al'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[783]:
0
In [784]:
#MAPPING/REPLACING REQUIRES NO NULL VALUES IN COLUMN LINKED TO set_index COLUMN df_al1['neighbourhood_name_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['Neighbourhood_Name']) df_al1['district_code_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Code']) df_al1['district_name_al'] = df_al1['neighbourhood_code_al'].map(df_district_neighbourhood_table.set_index('Neighbourhood_Code')['District_Name'])
In [785]:
#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_al1[~df_al1['neighbourhood_name_al'].isin(df_district_neighbourhood_table['Neighbourhood_Name'])].shape[0]
Out[785]:
0
In [786]:
#CHECK HOW MANY RECORDS NOT COMPATIBLE WITH NEIGHBOURHOOD TABLE df_al1[~df_al1['neighbourhood_code_al'].isin(df_district_neighbourhood_table['Neighbourhood_Code'])].shape[0]
Out[786]:
0
NORMALIZATION¶
In [787]:
df_al2 = df_al1.copy()
In [788]:
#CHECK 'street_number_1' df_al2[df_al2['street_number_1_al'].str.isdecimal()==False].value_counts('street_number_1_al')
Out[788]:
street_number_1_al 51*LB 2 6*8 2 116*LB 1 8*10 1 70*74 1 58*60 1 56*58 1 55*57 1 52*54 1 48*52 1 149*151 1 45*47 1 41*51 1 402*404 1 38*42 1 33*LB 1 21*23 1 17*19 1 86*88 1 dtype: int64
In [789]:
#CHECK 'street_number_2' df_al2[df_al2['street_number_2_al'].str.isdecimal()==False].value_counts('street_number_2_al')
Out[789]:
Series([], dtype: int64)
In [790]:
def split_street_number_1_2 (df_target,column_address_1,column_address_2): df = df_target.loc[df_target[column_address_1].str.isdecimal()==False,column_address_1] df = df.str.split(pat='(d+)', expand=True) df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_2] = df.iloc[:,3] #column_address_2 needs to precede column_address_1 df_target.loc[df_target[column_address_1].str.isdecimal()==False, column_address_1] = df.iloc[:,1] #the condition on which .loc is based is lost return print('Split Values:'), df
In [791]:
split_street_number_1_2(df_al2,'street_number_1_al','street_number_2_al')
Split Values:
Out[791]:
(None, 0 1 2 3 4 7 56 * 58 11 58 * 60 18 17 * 19 19 45 * 47 28 55 * 57 39 6 * 8 43 21 * 23 64 402 * 404 67 48 * 52 94 51 *LB None None 110 116 *LB None None 112 33 *LB None None 115 51 *LB None None 130 6 * 8 140 70 * 74 141 41 * 51 159 38 * 42 160 52 * 54 183 149 * 151 185 86 * 88 192 8 * 10 )
In [792]:
#CHECK AGAIN 'street_number_1' df_al2[df_al2['street_number_1_al'].str.isdecimal()==False].value_counts('street_number_1_al')
Out[792]:
Series([], dtype: int64)
In [793]:
#CHECK AGAIN 'street_number_2' df_al2[df_al2['street_number_2_al'].str.isdecimal()==False].value_counts('street_number_2_al')
Out[793]:
Series([], dtype: int64)
In [794]:
df_al2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 121 entries, 6 to 200 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_al 121 non-null object 1 longitude_al 121 non-null object 2 category_al 121 non-null object 3 district_code_al 121 non-null object 4 district_name_al 121 non-null object 5 neighbourhood_code_al 121 non-null object 6 neighbourhood_name_al 121 non-null object 7 address_name_al 121 non-null object 8 street_number_1_al 121 non-null object 9 street_number_2_al 17 non-null object 10 name_al 121 non-null object 11 rtc_al 121 non-null object dtypes: object(12) memory usage: 12.3+ KB
PROBLEM: THE RTC CODES APPEAR IN A DIFFERENT FORMAT
In [795]:
df_n_places[df_n_places['category']=='Albergs'].head(1)
Out[795]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 02-2014-2576 | ALB-472 | None | Albergs | AV DIAGONAL 436 | AV | DIAGONAL | 436 | None | None | … | None | None | None | 02 | Eixample | 07 | la Dreta de l’Eixample | None | None | 24.0 |
1 rows × 23 columns
In [796]:
df_al2.head(1)
Out[796]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | Alberg Arc House | AJ000645 |
VERIFY ADDRESS MATCH
In [797]:
df_n_places_albergs_only = df_n_places[df_n_places['category']=='Albergs'] df_n_places_albergs_only.head(1)
Out[797]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 02-2014-2576 | ALB-472 | None | Albergs | AV DIAGONAL 436 | AV | DIAGONAL | 436 | None | None | … | None | None | None | 02 | Eixample | 07 | la Dreta de l’Eixample | None | None | 24.0 |
1 rows × 23 columns
In [798]:
df_al_verify_address_match = df_al2.copy()
In [799]:
df_n_places_al_verify_address_match = df_n_places_albergs_only.copy()
In [800]:
#TO MERGE ON PARTIAL ADDRESS: #STRIP DIGITS FROM df_n_places_al_verify_address_match['address_verify'] = df_n_places_al_verify_address_match['address'].replace(r'd+','',regex=True).replace(r's+',' ',regex=True).str.strip() #UPPER CASE THE 'address_name_al' df_al_verify_address_match['address_verify_al'] = df_al_verify_address_match['address_name_al'].str.upper()
In [801]:
df_n_places_al_verify_address_match['address_verify']
Out[801]:
113 AV DIAGONAL 148 AV DIAGONAL 149 AV DIAGONAL 265 AV ICARIA 336 AV MERIDIANA ... 10125 RDA SANT PERE PR 10160 RDA UNIVERSITAT EN 10330 VIA AUGUSTA 10331 VIA AUGUSTA 10338 VIA JULIA Name: address_verify, Length: 125, dtype: object
In [802]:
df_verify_address_match = pd.merge(df_n_places_al_verify_address_match,df_al_verify_address_match, how='inner', left_on=['address_verify','street_number_1','neighbourhood_code'], right_on=['address_verify_al','street_number_1_al','neighbourhood_code_al']) df_verify_address_match[['rtc','rtc_al','street_number_1','street_number_1_al','neighbourhood_code','neighbourhood_code_al']]
Out[802]:
rtc | rtc_al | street_number_1 | street_number_1_al | neighbourhood_code | neighbourhood_code_al | |
---|---|---|---|---|---|---|
0 | ALB-472 | AJ000472 | 436 | 436 | 07 | 07 |
1 | ALB-562 | AJ000562 | 578 | 578 | 26 | 26 |
2 | ALB-562 | AJ000562 | 578 | 578 | 26 | 26 |
3 | ALB-491 | AJ000491 | 97 | 97 | 65 | 65 |
4 | ALB-460 | AJ000460 | 52 | 52 | 36 | 36 |
5 | ALB-529 | AJ000529 | 12 | 12 | 11 | 11 |
6 | ALB-593 | AJ000593 | 75 | 75 | 08 | 08 |
7 | ALB-539 | AJ000539 | 3 | 3 | 07 | 07 |
8 | ALB-608 | AJ000608 | 65 | 65 | 07 | 07 |
9 | ALB-665 | AJ000665 | 48 | 48 | 11 | 11 |
10 | ALB-496 | AJ000496 | 52 | 52 | 08 | 08 |
11 | ALB-605 | AJ000605 | 30 | 30 | 15 | 15 |
12 | ALB-639 | AJ000639 | 355 | 355 | 07 | 07 |
13 | ALB-15 | AJ000015 | 56 | 56 | 23 | 23 |
14 | ALB-471 | AJ000471 | 17 | 17 | 02 | 02 |
15 | ALB-517 | AJ000517 | 5 | 5 | 64 | 64 |
16 | ALB-670 | AJ000670 | 176 | 176 | 07 | 07 |
17 | ALB-512 | AJ000512 | 8 | 8 | 11 | 11 |
18 | ALB-537 | AJ000537 | 17 | 17 | 31 | 31 |
19 | ALB-651 | AJ000651 | 237 | 237 | 06 | 06 |
20 | ALB-427 | AJ000427 | 2 | 2 | 22 | 22 |
21 | ALB-635 | AJ000635 | 290 | 290 | 07 | 07 |
22 | ALB-532 | AJ000532 | 70 | 70 | 18 | 18 |
23 | ALB-598 | AJ000598 | 38 | 38 | 31 | 31 |
24 | ALB-440 | AJ000440 | 91 | 91 | 01 | 01 |
25 | ALB-559 | AJ000559 | 5 | 5 | 18 | 18 |
26 | ALB-486 | AJ000486 | 23 | 23 | 11 | 11 |
27 | ALB-638 | AJ000638 | 20 | 20 | 11 | 11 |
28 | ALB-535 | AJ000535 | 35 | 35 | 31 | 31 |
29 | ALB-682 | AJ000682 | 43 | 43 | 18 | 18 |
30 | ALB-667 | AJ000667 | 20 | 20 | 18 | 18 |
31 | ALB-442 | AJ000442 | 5 | 5 | 04 | 04 |
32 | ALB-568 | AJ000568 | 563 | 563 | 08 | 08 |
33 | ALB-580 | AJ000580 | 628 | 628 | 07 | 07 |
34 | ALB-520 | AJ000520 | 4 | 4 | 03 | 03 |
35 | ALB-531 | AJ000531 | 5 | 5 | 20 | 20 |
36 | ALB-609 | AJ000609 | 64 | 64 | 07 | 07 |
37 | ALB-611 | AJ000611 | 51 | 51 | 10 | 10 |
38 | ALB-587 | AJ000587 | 56 | 56 | 07 | 07 |
39 | ALB-631 | AJ000631 | 65 | 65 | 26 | 26 |
40 | ALB-631 | AJ000632 | 65 | 65 | 26 | 26 |
41 | ALB-632 | AJ000631 | 65 | 65 | 26 | 26 |
42 | ALB-632 | AJ000632 | 65 | 65 | 26 | 26 |
BY MERGING THE DATAFRAMES ON PARTIAL ADDRESS, THE RTC CODES APPEAR TO MATCH EXCEPT FOR THE STARTING LETTERS
THEREFORE:
MODIFY THE RTC CODES ON THE STARTING LETTERS AND VERIFY RTC MATCH:
In [803]:
df_n_places_al_verify_rtc_match = df_n_places[df_n_places['category']=='Albergs'] df_n_places_al_verify_rtc_match.head(1)
Out[803]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
113 | 02-2014-2576 | ALB-472 | None | Albergs | AV DIAGONAL 436 | AV | DIAGONAL | 436 | None | None | … | None | None | None | 02 | Eixample | 07 | la Dreta de l’Eixample | None | None | 24.0 |
1 rows × 23 columns
In [804]:
df_al_verify_rtc_match = df_al2.copy()
In [805]:
df_al_verify_rtc_match['rtc_al_modified'] = df_al_verify_rtc_match['rtc_al'].astype(str).str.replace('AJ000','ALB-') df_al_verify_rtc_match.head(1)
Out[805]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | rtc_al_modified | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | Alberg Arc House | AJ000645 | ALB-645 |
In [806]:
df_verify_rtc_match = df_al_verify_rtc_match.merge(df_n_places_al_verify_rtc_match, how='inner', left_on=['rtc_al_modified'], right_on=['rtc'])
In [807]:
df_verify_rtc_match
Out[807]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | … | None | 1 | 2 | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | None | None | 18.0 |
1 | 41.39297340160799 | 2.125102845599096 | Albergs | 05 | Sarrià-Sant Gervasi | 23 | Sarrià | C Capità Arenas | 56 | 58 | … | None | None | None | 05 | Sarrià-Sant Gervasi | 23 | Sarrià | None | None | 219.0 |
2 | 41.37924859372896 | 2.137341737796588 | Albergs | 03 | Sants-Montjuïc | 18 | Sants | C Vallespir | 34 | None | … | None | BJ | None | 03 | Sants-Montjuïc | 18 | Sants | None | None | 13.0 |
3 | 41.373334952061036 | 2.1657931149717276 | Albergs | 03 | Sants-Montjuïc | 11 | el Poble-sec | C Salvà | 36 | None | … | None | None | None | 03 | Sants-Montjuïc | 11 | el Poble-sec | None | None | 60.0 |
4 | 41.39206371135132 | 2.1705001263052957 | Albergs | 02 | Eixample | 07 | la Dreta de l’Eixample | C Roger de Llúria | 40 | None | … | None | 1 | 1 | 02 | Eixample | 07 | la Dreta de l’Eixample | None | None | 19.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
109 | 41.422063723798516 | 2.101812736653692 | Albergs | 05 | Sarrià-Sant Gervasi | 22 | Vallvidrera, el Tibidabo i les Planes | C Major del Rectoret | 2 | None | … | None | None | None | 05 | Sarrià-Sant Gervasi | 22 | Vallvidrera, el Tibidabo i les Planes | None | None | 247.0 |
110 | 41.3735956133217 | 2.169147595451214 | Albergs | 03 | Sants-Montjuïc | 11 | el Poble-sec | C Lafont | 8 | 10 | … | None | None | None | 03 | Sants-Montjuïc | 11 | el Poble-sec | None | None | 148.0 |
111 | 41.375888785538486 | 2.171010168799885 | Albergs | 01 | Ciutat Vella | 01 | el Raval | C Nou de la Rambla | 91 | None | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 100.0 |
112 | 41.38101433917356 | 2.1749336893605618 | Albergs | 01 | Ciutat Vella | 02 | el Barri Gòtic | C Ferran | 17 | None | … | None | None | None | 01 | Ciutat Vella | 02 | el Barri Gòtic | None | None | 151.0 |
113 | 41.394674123922854 | 2.1693004756249827 | Albergs | 02 | Eixample | 07 | la Dreta de l’Eixample | C Bruc | 94 | None | … | None | EN | 2 | 02 | Eixample | 07 | la Dreta de l’Eixample | None | None | 16.0 |
114 rows × 36 columns
THE MODIFIED RTC COLUMN IS ADDED TO df_al TO ENABLE MERGING
In [808]:
df_al2['rtc_al_modified'] = df_al2['rtc_al'].str.replace('AJ000','ALB-')
In [809]:
df_al2.head(1)
Out[809]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | rtc_al_modified | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | Alberg Arc House | AJ000645 | ALB-645 |
In [810]:
df_alberg = df_al2.copy()
MERGE WITH PANDAS + SQL:¶
– pandasql¶
– duckdb¶
REQUIRED LIBRARIES¶
In [811]:
#FOR SQL ELABORATION - 2 alternative PANDAS libraries: from pandasql import sqldf pysqldf = lambda q: sqldf(q, globals()) import duckdb
USEFUL RESOURCES
https://towardsdatascience.com/query-pandas-dataframe-with-sql-2bb7a509793d
https://hex.tech/blog/how-to-write-sql-in-pandas/
MERGE WITH PANDAS:¶
– pandas.merge()¶
MERGE 1 : df_n_places + df_hut : _m1¶
In [812]:
df_n_places_m1_0 = df_n_places.copy() df_n_places_m1_0.head(1)
Out[812]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | 7 | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [813]:
df_hut_m1_0 = df_hut.copy() df_hut_m1_0.head(1)
Out[813]:
n_practice_hut | district_code_hut | district_name_hut | neighbourhood_code_hut | neighbourhood_name_hut | street_type_hut | street_hut | street_number_1_hut | street_letter_1_hut | street_number_2_hut | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 03-2010-0437 | 03 | Sants-Montjuïc | 16 | la Bordeta | Carrer | CONSTITUCIO | 127 | None | 129 | … | None | None | None | 4 | 2 | HUTB-003502 | 7.0 | 2.132214596 | 41.36719526 | Habitatges d’Ús Turístic |
1 rows × 21 columns
In [814]:
df_n_places_m1_0 = df_n_places_m1_0.merge(df_hut_m1_0, how='inner', left_on=['n_practice'], right_on=['n_practice_hut']) df_n_places_m1_0['category'].value_counts()
Out[814]:
Habitatges d'Ús Turístic 9409 Name: category, dtype: int64
In [815]:
df_n_places_m1_1 = df_n_places_m1_0.copy()
In [816]:
df_n_places_m1_1.columns
Out[816]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'n_practice_hut', 'district_code_hut', 'district_name_hut', 'neighbourhood_code_hut', 'neighbourhood_name_hut', 'street_type_hut', 'street_hut', 'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut', 'block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'rtc_hut', 'n_places_hut', 'longitude_hut', 'latitude_hut', 'name_hut'], dtype='object')
REMAINING RECORDS¶
N_PRACTICE¶
In [817]:
#RECORDS NOT INCLUDED df_hut_remaining = df_n_places_m1_1[((~df_n_places_m1_1['n_practice_hut'].isin(df_n_places_m1_1['n_practice'])) | (df_hut_m1_0['n_practice_hut'].isnull()))] df_hut_remaining
Out[817]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | block_hut | entrance_hut | stair_hut | floor_hut | door_hut | rtc_hut | n_places_hut | longitude_hut | latitude_hut | name_hut |
---|
0 rows × 44 columns
In [818]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['n_practice_hut'], inplace = True)
COMPARISON¶
RTC¶
In [819]:
#CHECK DIFFERENCES df_n_places_m1_1[['rtc','rtc_hut']][(df_n_places_m1_1['rtc']!= df_n_places_m1_1['rtc_hut']) & (df_n_places_m1_1['rtc_hut'].notnull())]
Out[819]:
rtc | rtc_hut |
---|
In [820]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['rtc_hut'], inplace = True)
N_PLACES¶
In [821]:
#CHECK DIFFERENCES df_n_places_m1_1[['n_places','n_places_hut']][(df_n_places_m1_1['n_places']!= df_n_places_m1_1['n_places_hut']) & (df_n_places_m1_1['n_places_hut'].notnull())]
Out[821]:
n_places | n_places_hut |
---|
In [822]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['n_places_hut'], inplace = True)
NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME¶
In [823]:
#CHECK DIFFERENCES df_n_places_m1_1[['district_name','district_name_hut']][(df_n_places_m1_1['district_name']!= df_n_places_m1_1['district_name_hut']) & (df_n_places_m1_1['district_name_hut'].notnull())]
Out[823]:
district_name | district_name_hut |
---|
In [824]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['district_name_hut'], inplace = True)
In [825]:
#CHECK DIFFERENCES df_n_places_m1_1[['district_code','district_code_hut']][(df_n_places_m1_1['district_code']!= df_n_places_m1_1['district_code_hut']) & (df_n_places_m1_1['district_code_hut'].notnull())]
Out[825]:
district_code | district_code_hut |
---|
In [826]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['district_code_hut'], inplace = True)
In [827]:
#CHECK DIFFERENCES df_n_places_m1_1[['neighbourhood_code','neighbourhood_code_hut']][(df_n_places_m1_1['neighbourhood_code']!= df_n_places_m1_1['neighbourhood_code_hut']) & (df_n_places_m1_1['neighbourhood_code_hut'].notnull())]
Out[827]:
neighbourhood_code | neighbourhood_code_hut |
---|
In [828]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['neighbourhood_code_hut'], inplace = True)
In [829]:
#CHECK DIFFERENCES df_n_places_m1_1[['neighbourhood_name','neighbourhood_name_hut']][(df_n_places_m1_1['neighbourhood_name']!= df_n_places_m1_1['neighbourhood_name_hut']) & (df_n_places_m1_1['neighbourhood_name_hut'].notnull())]
Out[829]:
neighbourhood_name | neighbourhood_name_hut |
---|
In [830]:
#DROP ADDED COLUMN df_n_places_m1_1.drop(columns=['neighbourhood_name_hut'], inplace = True)
COLUMN FILLING¶
In [831]:
df_n_places_m1_2 = df_n_places_m1_1.copy()
In [832]:
#longitude df_n_places_m1_2.loc[df_n_places_m1_2['longitude_hut'].notnull(),'longitude'] = df_n_places_m1_2['longitude_hut']
In [833]:
#latitude df_n_places_m1_2.loc[df_n_places_m1_2['latitude_hut'].notnull(),'latitude'] = df_n_places_m1_2['latitude_hut']
In [834]:
#name df_n_places_m1_2.loc[df_n_places_m1_2['name_hut'].notnull(),'name'] = df_n_places_m1_2['name_hut']
In [835]:
df_n_places_m1_2.columns
Out[835]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'street_type_hut', 'street_hut', 'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut', 'block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'longitude_hut', 'latitude_hut', 'name_hut'], dtype='object')
In [836]:
#DROP df_n_places_m1_2.drop(columns=['street_type_hut', 'street_hut', 'street_number_1_hut', 'street_letter_1_hut', 'street_number_2_hut', 'street_letter_2_hut', 'block_hut', 'entrance_hut', 'stair_hut', 'floor_hut', 'door_hut', 'longitude_hut', 'latitude_hut', 'name_hut'], inplace=True)
In [837]:
df_n_places_m1_3 = df_n_places_m1_2.copy()
MISSING VALUES¶
DISTRICT CODE, DISTRICT NAME, NEIGHBOURHOOD CODE, NEIGHBOURHOOD NAME¶
In [838]:
#CHECK IF THERE NULL VALUES BETWEEN 'district_name','district_code','neighbourhood_name','neighbourhood_code' - longitude IS USED TO LIMIT SEARCH ON ADDED VALUES df_n_places_m1_3[['n_practice','rtc','address','district_name','district_code','neighbourhood_name','neighbourhood_code']][(df_n_places_m1_3['longitude'].notnull()) & ((df_n_places_m1_3['district_name'].isnull()) | (df_n_places_m1_3['district_code'].isnull()) | (df_n_places_m1_3['neighbourhood_name'].isnull()) | (df_n_places_m1_3['neighbourhood_code'].isnull()))]
Out[838]:
n_practice | rtc | address | district_name | district_code | neighbourhood_name | neighbourhood_code |
---|
N_PLACES¶
In [839]:
#CHECK WHICH CATEGORIES HAVE MISSING VALUES OF INTEREST - N PLACES df_n_places_m1_3['category'][(df_n_places_m1_3['n_places'].isnull()) | (df_n_places_m1_3['n_places']==None) | (df_n_places_m1_3['n_places']=='nan') | (df_n_places_m1_3['n_places']=='')].value_counts()
Out[839]:
Habitatges d'Ús Turístic 13 Name: category, dtype: int64
In [840]:
#FOCUS ON Habitatges d'Ús Turístic - SEE GENERAL FEATURES OF THE CATEGORY WITH MISSING DATA ON THE VARIABLE OF INTEREST - N_PLACES df_n_places_m1_3['n_places'][df_n_places_m1_3['category']=="Habitatges d'Ús Turístic"].describe()
Out[840]:
count 9396.000000 mean 6.038101 std 3.499751 min 1.000000 25% 4.000000 50% 5.000000 75% 7.000000 max 79.000000 Name: n_places, dtype: float64
In [841]:
#REPLACE THE MISSING VALUES ON N_PLACES WITH THE MEDIAN FOR THE SAME CATEGORY df_n_places_m1_3.loc[df_n_places_m1_3['category']=="Habitatges d'Ús Turístic",'n_places'] = df_n_places_m1_3.loc[df_n_places_m1_3['category']=="Habitatges d'Ús Turístic",'n_places'].fillna(df_n_places_m1_3.groupby('category')['n_places'].transform('median'))
In [842]:
#CHECK FILLING IN OF MISSING VALUES OF INTEREST - N PLACES df_n_places_m1_3['category'][(df_n_places_m1_3['n_places'].isnull()) | (df_n_places_m1_3['n_places']==None) | (df_n_places_m1_3['n_places']=='nan') | (df_n_places_m1_3['n_places']=='')].value_counts()
Out[842]:
Series([], Name: category, dtype: int64)
The remaining missing values belong to another category. No action is taken now as it might be possible to recover that information later from the table referring to that category.
In [843]:
df_n_places_m1_3.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9409 entries, 0 to 9408 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 9409 non-null object 1 rtc 9409 non-null object 2 name 9409 non-null object 3 category 9409 non-null object 4 address 9409 non-null object 5 street_type 9409 non-null object 6 street 9409 non-null object 7 street_number_1 9409 non-null object 8 street_letter_1 116 non-null object 9 street_number_2 747 non-null object 10 street_letter_2 3 non-null object 11 block 10 non-null object 12 entrance 3 non-null object 13 stair 689 non-null object 14 floor 9378 non-null object 15 door 8528 non-null object 16 district_code 9409 non-null object 17 district_name 9409 non-null object 18 neighbourhood_code 9409 non-null object 19 neighbourhood_name 9409 non-null object 20 longitude 9409 non-null object 21 latitude 9409 non-null object 22 n_places 9409 non-null float64 dtypes: float64(1), object(22) memory usage: 1.7+ MB
df_hut_n_places_coordinates¶
In [844]:
df_hut_n_places_coordinates = df_n_places_m1_3.copy()
MERGE 2 : df_n_places + df_hotel : _m2¶
In [845]:
df_n_places_m2_0 = df_n_places.copy() df_n_places_m2_0.head(1)
Out[845]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | 7 | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [846]:
df_hotel_m2_0 = df_hotel.copy() df_hotel_m2_0.head(1)
Out[846]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | C Rambla | 2.170638831395403 | 41.38514182378773 | Hotels 1 estr. | Ciutat Vella | 138 | 01 | 02 | Hotel Toledano | el Barri Gòtic | HB-000480 |
In [847]:
df_hotel_m2_0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 441 entries, 0 to 440 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 street_number_2_hotel 13 non-null object 1 address_hotel 441 non-null object 2 longitude_hotel 441 non-null object 3 latitude_hotel 441 non-null object 4 category_hotel 441 non-null object 5 district_name_hotel 441 non-null object 6 street_number_1_hotel 439 non-null object 7 district_code_hotel 441 non-null object 8 neighbourhood_code_hotel 441 non-null object 9 name_hotel 441 non-null object 10 neighbourhood_name_hotel 441 non-null object 11 rtc_hotel 440 non-null object dtypes: object(12) memory usage: 41.5+ KB
In [848]:
df_n_places_m2_0 = df_n_places_m2_0.merge(df_hotel_m2_0, how='inner', left_on=['rtc'], right_on=['rtc_hotel'])
In [849]:
df_n_places_m2_0[df_n_places_m2_0['rtc_hotel'].notnull()].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 435 entries, 0 to 434 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 435 non-null object 1 rtc 435 non-null object 2 name 0 non-null object 3 category 435 non-null object 4 address 435 non-null object 5 street_type 435 non-null object 6 street 435 non-null object 7 street_number_1 435 non-null object 8 street_letter_1 1 non-null object 9 street_number_2 88 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 50 non-null object 15 door 12 non-null object 16 district_code 435 non-null object 17 district_name 435 non-null object 18 neighbourhood_code 435 non-null object 19 neighbourhood_name 435 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 435 non-null float64 23 street_number_2_hotel 10 non-null object 24 address_hotel 435 non-null object 25 longitude_hotel 435 non-null object 26 latitude_hotel 435 non-null object 27 category_hotel 435 non-null object 28 district_name_hotel 435 non-null object 29 street_number_1_hotel 433 non-null object 30 district_code_hotel 435 non-null object 31 neighbourhood_code_hotel 435 non-null object 32 name_hotel 435 non-null object 33 neighbourhood_name_hotel 435 non-null object 34 rtc_hotel 435 non-null object dtypes: float64(1), object(34) memory usage: 122.3+ KB
In [850]:
df_n_places_m2_1 = df_n_places_m2_0.copy()
REMAINING RECORDS¶
RTC¶
In [851]:
#RECORDS NOT INCLUDED df_hotel_remaining = df_hotel_m2_0[(~df_hotel_m2_0['rtc_hotel'].isin(df_n_places_m2_1['rtc_hotel'])) | (df_hotel_m2_0['rtc_hotel'].isnull())] df_hotel_remaining
Out[851]:
street_number_2_hotel | address_hotel | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | 32 | Avinguda del Tibidabo | 2.1347889441149417 | 41.41358309122107 | Hotels 5 estr. | Sarrià-Sant Gervasi | 32 | 05 | 25 | Hotel Boutique Mirlo Barcelona | Sant Gervasi – la Bonanova | HB-004948 |
142 | None | C Nou de la Rambla | 2.1664787325847077 | 41.371682397631936 | Hotels 3 estr. | Sants-Montjuïc | 174 | 03 | 11 | Hotel Brumell | el Poble-sec | HB-004690 |
147 | 84 | Ronda de Sant Antoni | 2.163974947423502 | 41.38376584459594 | Hotels 4 estr. | Ciutat Vella | 84 | 01 | 01 | Hotel Antiga Casa Buenavista | el Raval | None |
220 | None | C Hospital | 2.169014849771575 | 41.38001981371473 | Hotels 3 estr. | Ciutat Vella | 101 | 01 | 01 | Hotel Raval House | el Raval | HB-001213 |
244 | None | Av Diagonal | 2.1088538068319096 | 41.38142529046899 | Hotels 5 estr. | Les Corts | 661 | 04 | 20 | Hotel Rey Juan Carlos I | la Maternitat i Sant Ramon | HB-003961 *Temporalment tancat |
379 | 13 | Carrer de Casp | 2.1699973675495325 | 41.38898611306411 | Hotels 5 estr. | Eixample | 1 | 02 | 07 | Hotel ME Barcelona | la Dreta de l’Eixample | HB-004955 |
In [852]:
#VERIFY RECORDS BEFORE COLUMN DROP df_n_places_m2_1[(df_n_places_m2_1['rtc']!=df_n_places_m2_1['rtc_hotel']) & (df_n_places_m2_1['rtc_hotel'].notnull())]
Out[852]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | longitude_hotel | latitude_hotel | category_hotel | district_name_hotel | street_number_1_hotel | district_code_hotel | neighbourhood_code_hotel | name_hotel | neighbourhood_name_hotel | rtc_hotel |
---|
0 rows × 35 columns
In [853]:
df_n_places_m2_1.drop(columns='rtc_hotel', inplace=True)
COMPARISON¶
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME¶
In [854]:
df_n_places_m2_2 = df_n_places_m2_1.copy()
In [855]:
df_n_places_m2_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hotel','neighbourhood_name_hotel']][(df_n_places_m2_2['neighbourhood_name']!= df_n_places_m2_2['neighbourhood_name_hotel']) & (df_n_places_m2_2['neighbourhood_name_hotel'].notnull())]
Out[855]:
rtc | address | neighbourhood_code | neighbourhood_name | neighbourhood_code_hotel | neighbourhood_name_hotel | |
---|---|---|---|---|---|---|
389 | HB-002726 | PLA PALAU 19 | 02 | el Barri Gòtic | 04 | Sant Pere, Santa Caterina i la Ribera |
MANUAL VERIFICATION FOR RECORDS ABOVE:
- THE RIGHT NEIGHBOURHOOD NAME IS IN neighbourhood_name_hotel FOR “9469”
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [856]:
#FILL IN MISSING VALUES df_n_places_m2_2.loc[df_n_places_m2_2['rtc']=='HB-002726','neighbourhood_code'] = "04" df_n_places_m2_2.loc[df_n_places_m2_2['rtc']=='HB-002726','neighbourhood_name'] = "Sant Pere, Santa Caterina i la Ribera" df_n_places_m2_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hotel','neighbourhood_name_hotel']][df_n_places_m2_2['rtc']=='HB-002726']
Out[856]:
rtc | address | neighbourhood_code | neighbourhood_name | neighbourhood_code_hotel | neighbourhood_name_hotel | |
---|---|---|---|---|---|---|
389 | HB-002726 | PLA PALAU 19 | 04 | Sant Pere, Santa Caterina i la Ribera | 04 | Sant Pere, Santa Caterina i la Ribera |
In [857]:
df_n_places_m2_2.drop(columns=['neighbourhood_code_hotel','neighbourhood_name_hotel'], inplace=True)
In [858]:
df_n_places_m2_3 = df_n_places_m2_2.copy()
In [859]:
df_n_places_m2_3[['address','district_code','district_name','district_code_hotel','district_name_hotel']][(df_n_places_m2_3['district_name']!= df_n_places_m2_3['district_name_hotel']) & (df_n_places_m2_3['district_name_hotel'].notnull())]
Out[859]:
address | district_code | district_name | district_code_hotel | district_name_hotel |
---|
In [860]:
df_n_places_m2_3.drop(columns=['district_code_hotel','district_name_hotel'], inplace=True)
COLUMN FILLING¶
In [861]:
df_n_places_m2_4 = df_n_places_m2_3.copy()
In [862]:
df_n_places_m2_4.columns
Out[862]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'street_number_2_hotel', 'address_hotel', 'longitude_hotel', 'latitude_hotel', 'category_hotel', 'street_number_1_hotel', 'name_hotel'], dtype='object')
In [863]:
#longitude df_n_places_m2_4.loc[df_n_places_m2_4['longitude_hotel'].notnull(),'longitude'] = df_n_places_m2_4['longitude_hotel']
In [864]:
#latitude df_n_places_m2_4.loc[df_n_places_m2_4['latitude_hotel'].notnull(),'latitude'] = df_n_places_m2_4['latitude_hotel']
In [865]:
#name df_n_places_m2_4.loc[df_n_places_m2_4['name_hotel'].notnull(),'name'] = df_n_places_m2_4['name_hotel']
In [866]:
#DROP df_n_places_m2_4.drop(columns=['street_number_2_hotel', 'address_hotel', 'longitude_hotel', 'latitude_hotel', 'category_hotel', 'street_number_1_hotel', 'name_hotel'], inplace=True)
In [867]:
df_n_places_m2_4.columns
Out[867]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places'], dtype='object')
In [868]:
df_n_places_m2_4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 435 entries, 0 to 434 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 435 non-null object 1 rtc 435 non-null object 2 name 435 non-null object 3 category 435 non-null object 4 address 435 non-null object 5 street_type 435 non-null object 6 street 435 non-null object 7 street_number_1 435 non-null object 8 street_letter_1 1 non-null object 9 street_number_2 88 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 50 non-null object 15 door 12 non-null object 16 district_code 435 non-null object 17 district_name 435 non-null object 18 neighbourhood_code 435 non-null object 19 neighbourhood_name 435 non-null object 20 longitude 435 non-null object 21 latitude 435 non-null object 22 n_places 435 non-null float64 dtypes: float64(1), object(22) memory usage: 81.6+ KB
df_hotel_n_places_coordinates¶
In [869]:
df_hotel_n_places_coordinates = df_n_places_m2_4.copy()
MERGE 3 : df_n_places + df_hut + df_hotel + df_hostel : _m3¶
In [870]:
df_n_places_m3_0 = df_n_places.copy() df_n_places_m3_0.head(1)
Out[870]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | 7 | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [871]:
df_hostel_m3_0 = df_hostel.copy() df_hostel_m3_0.head(1)
Out[871]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.3964776648101 | 2.175311353516649 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 346 | None | HB-004497 | Hostal Hostalin Barcelona Diputació |
In [872]:
df_hostel_m3_0.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 243 entries, 0 to 243 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 latitude_hostel 243 non-null object 1 longitude_hostel 243 non-null object 2 category_hostel 243 non-null object 3 district_code_hostel 243 non-null object 4 district_name_hostel 243 non-null object 5 neighbourhood_code_hostel 243 non-null object 6 neighbourhood_name_hostel 243 non-null object 7 address_hostel 243 non-null object 8 street_number_1_hostel 243 non-null object 9 street_number_2_hostel 243 non-null object 10 rtc_hostel 241 non-null object 11 name_hostel 243 non-null object dtypes: object(12) memory usage: 24.7+ KB
In [873]:
df_n_places_m3_0 = df_n_places_m3_0.merge(df_hostel_m3_0, how='inner', left_on=['rtc'], right_on=['rtc_hostel'])
In [874]:
df_n_places_m3_0.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 233 entries, 0 to 232 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 233 non-null object 1 rtc 233 non-null object 2 name 0 non-null object 3 category 233 non-null object 4 address 233 non-null object 5 street_type 233 non-null object 6 street 233 non-null object 7 street_number_1 233 non-null object 8 street_letter_1 3 non-null object 9 street_number_2 23 non-null object 10 street_letter_2 1 non-null object 11 block 3 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 155 non-null object 15 door 99 non-null object 16 district_code 233 non-null object 17 district_name 233 non-null object 18 neighbourhood_code 233 non-null object 19 neighbourhood_name 233 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 231 non-null float64 23 latitude_hostel 233 non-null object 24 longitude_hostel 233 non-null object 25 category_hostel 233 non-null object 26 district_code_hostel 233 non-null object 27 district_name_hostel 233 non-null object 28 neighbourhood_code_hostel 233 non-null object 29 neighbourhood_name_hostel 233 non-null object 30 address_hostel 233 non-null object 31 street_number_1_hostel 233 non-null object 32 street_number_2_hostel 233 non-null object 33 rtc_hostel 233 non-null object 34 name_hostel 233 non-null object dtypes: float64(1), object(34) memory usage: 65.5+ KB
In [875]:
df_n_places_m3_1 = df_n_places_m3_0.copy()
REMAINING RECORDS¶
RTC¶
In [876]:
#RECORDS NOT INCLUDED df_hostel_remaining = df_hostel_m3_0[(~df_hostel_m3_0['rtc_hostel'].isin(df_n_places_m3_1['rtc'])) | (df_hostel_m3_0['rtc_hostel'].isnull())] df_hostel_remaining
Out[876]:
latitude_hostel | longitude_hostel | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 41.3942267039029 | 2.151298840171046 | Pensions, hostals | 02 | Eixample | 08 | l’Antiga Esquerra de l’Eixample | Av Diagonal | 433 | None | HB-004721 | Hostal Principal B&BCN |
43 | 41.409442466648215 | 2.1831572948146354 | Pensions, hostals | 10 | Sant Martí | 64 | el Camp de l’Arpa del Clot | C Mallorca | 537 | None | HB-003943 | Pensió Gimón |
45 | 41.397491747236074 | 2.1655695484910837 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Bruc | 150 | None | HB-004701 | Ally’s Guest House III |
62 | 41.4019276042057 | 2.1573692953635994 | Pensions, hostals | 06 | Gràcia | 31 | la Vila de Gràcia | C Torrent de l’Olla | 95 | None | HB-002608 | Pensió Alberdi |
128 | 41.38463275974521 | 2.1774675697341084 | Pensions, hostals | 01 | Ciutat Vella | 02 | el Barri Gòtic | Pl Ramon Berenguer el Gran | 2 | None | HB-004753 | Hostal The Moods Catedral |
133 | 41.42825999570545 | 2.1799199184547917 | Pensions, hostals | 08 | Nou Barris | 44 | Vilapicina i la Torre Llobeta | C Malgrat | 40 | None | HB-004741 | Hostal Lm Rooms Bcn |
171 | 41.388474283745325 | 2.1602212330703807 | Pensions, hostals | 02 | Eixample | 08 | l’Antiga Esquerra de l’Eixample | C Aragó | 222 | None | HB-004565 | Tripledos |
196 | 41.38524926219911 | 2.1699190393632026 | Pensions, hostals | 01 | Ciutat Vella | 01 | el Raval | C Rambla | 133 | None | HB-001137 | Pensió Barcelona City Ramblas |
220 | 41.37418832376817 | 2.165555511750057 | Pensions, hostals | 03 | Sants-Montjuïc | 11 | el Poble-sec | C Poeta Cabanyes | 18 | None | HB-004766 | Hostal Oliveta |
221 | 41.393888004415714 | 2.171481971236943 | Pensions, hostals | 02 | Eixample | 07 | la Dreta de l’Eixample | C Diputació | 327 | None | None | Hostal Bed & Break |
225 | 41.43256371802495 | 2.1585842656810272 | Pensions, hostals | 07 | Horta-Guinardó | 43 | Horta | C Chapà | 83 | 85 | HB-004758 | Hostal Barcelona Nice & Cozy |
227 | 41.37919578389479 | 2.174445287874625 | Pensions, hostals | 01 | Ciutat Vella | 01 | el Raval | C Nou de la Rambla | 1 | None | None | Hostal Mimi Las Ramblas |
In [877]:
#VERIFY RECORDS BEFORE COLUMN DROP df_n_places_m3_1[(df_n_places_m3_1['rtc']!=df_n_places_m3_1['rtc_hostel']) & (df_n_places_m3_1['rtc_hostel'].notnull())]
Out[877]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | category_hostel | district_code_hostel | district_name_hostel | neighbourhood_code_hostel | neighbourhood_name_hostel | address_hostel | street_number_1_hostel | street_number_2_hostel | rtc_hostel | name_hostel |
---|
0 rows × 35 columns
In [878]:
df_n_places_m3_1.drop(columns='rtc_hostel', inplace=True)
COMPARISON¶
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME¶
In [879]:
df_n_places_m3_2 = df_n_places_m3_1.copy()
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [880]:
df_n_places_m3_2[['rtc','address','neighbourhood_code','neighbourhood_name','neighbourhood_code_hostel','neighbourhood_name_hostel']][(df_n_places_m3_2['neighbourhood_name']!= df_n_places_m3_2['neighbourhood_name_hostel']) & (df_n_places_m3_2['neighbourhood_name_hostel'].notnull())]
Out[880]:
rtc | address | neighbourhood_code | neighbourhood_name | neighbourhood_code_hostel | neighbourhood_name_hostel |
---|
In [881]:
df_n_places_m3_2[['rtc','address','district_code','district_name','district_code_hostel','district_name_hostel']][(df_n_places_m3_2['district_name']!= df_n_places_m3_2['district_name_hostel']) & (df_n_places_m3_2['district_name_hostel'].notnull())]
Out[881]:
rtc | address | district_code | district_name | district_code_hostel | district_name_hostel |
---|
In [882]:
df_n_places_m3_2.drop(columns=['neighbourhood_code_hostel','neighbourhood_name_hostel','district_code_hostel','district_name_hostel'], inplace=True)
In [883]:
df_n_places_m3_3 = df_n_places_m3_2.copy()
COLUMN FILLING¶
In [884]:
df_n_places_m3_4 = df_n_places_m3_3.copy()
In [885]:
df_n_places_m3_4.columns
Out[885]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'latitude_hostel', 'longitude_hostel', 'category_hostel', 'address_hostel', 'street_number_1_hostel', 'street_number_2_hostel', 'name_hostel'], dtype='object')
In [886]:
#longitude df_n_places_m3_4.loc[df_n_places_m3_4['longitude_hostel'].notnull(),'longitude'] = df_n_places_m3_4['longitude_hostel']
In [887]:
#latitude df_n_places_m3_4.loc[df_n_places_m3_4['latitude_hostel'].notnull(),'latitude'] = df_n_places_m3_4['latitude_hostel']
In [888]:
#name df_n_places_m3_4.loc[df_n_places_m3_4['name_hostel'].notnull(),'name'] = df_n_places_m3_4['name_hostel']
In [889]:
#DROP df_n_places_m3_4.drop(columns=['latitude_hostel', 'longitude_hostel', 'category_hostel', 'address_hostel', 'street_number_1_hostel', 'street_number_2_hostel', 'name_hostel'], inplace=True)
In [890]:
df_n_places_m3_4.columns
Out[890]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places'], dtype='object')
In [891]:
df_n_places_m3_4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 233 entries, 0 to 232 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 233 non-null object 1 rtc 233 non-null object 2 name 233 non-null object 3 category 233 non-null object 4 address 233 non-null object 5 street_type 233 non-null object 6 street 233 non-null object 7 street_number_1 233 non-null object 8 street_letter_1 3 non-null object 9 street_number_2 23 non-null object 10 street_letter_2 1 non-null object 11 block 3 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 155 non-null object 15 door 99 non-null object 16 district_code 233 non-null object 17 district_name 233 non-null object 18 neighbourhood_code 233 non-null object 19 neighbourhood_name 233 non-null object 20 longitude 233 non-null object 21 latitude 233 non-null object 22 n_places 231 non-null float64 dtypes: float64(1), object(22) memory usage: 43.7+ KB
df_hostel_n_places_coordinates¶
In [892]:
df_hostel_n_places_coordinates = df_n_places_m3_4.copy()
MERGE 4 : df_n_places + df_hut + df_hotel + df_hostel + df_touristapartments : _m4¶
In [893]:
df_n_places_m4_0 = df_n_places.copy() df_n_places_m4_0.head(1)
Out[893]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | 7 | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [894]:
df_at_m4_0 = df_touristapartment.copy() df_at_m4_0.head(1)
Out[894]:
latitude_at | longitude_at | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41.38096727220521 | 2.1532132712982266 | Apartaments turÃstics | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Calà bria | 129 | None | Apartament TurÃstic Atenea Calabria | ATB-000001 |
In [895]:
df_n_places_m4_0 = df_n_places_m4_0.merge(df_at_m4_0, how='inner', left_on=['rtc'], right_on=['rtc_at'])
In [896]:
df_n_places_m4_0.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 12 entries, 0 to 11 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 12 non-null object 1 rtc 12 non-null object 2 name 0 non-null object 3 category 12 non-null object 4 address 12 non-null object 5 street_type 12 non-null object 6 street 12 non-null object 7 street_number_1 12 non-null object 8 street_letter_1 0 non-null object 9 street_number_2 2 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 1 non-null object 15 door 0 non-null object 16 district_code 12 non-null object 17 district_name 12 non-null object 18 neighbourhood_code 12 non-null object 19 neighbourhood_name 12 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 12 non-null float64 23 latitude_at 12 non-null object 24 longitude_at 12 non-null object 25 category_at 12 non-null object 26 district_code_at 12 non-null object 27 district_name_at 12 non-null object 28 neighbourhood_code_at 12 non-null object 29 neighbourhood_name_at 12 non-null object 30 address_name_at 12 non-null object 31 street_number_1_at 12 non-null object 32 street_number_2_at 1 non-null object 33 name_at 12 non-null object 34 rtc_at 12 non-null object dtypes: float64(1), object(34) memory usage: 3.4+ KB
In [897]:
df_n_places_m4_1 = df_n_places_m4_0.copy()
REMAINING RECORDS¶
RTC¶
In [898]:
#RECORDS NOT INCLUDED df_touristapartment_remaining = df_at_m4_0[(~df_at_m4_0['rtc_at'].isin(df_n_places_m4_1['rtc_at'])) | (df_at_m4_0['rtc_at'].isnull())] df_touristapartment_remaining
Out[898]:
latitude_at | longitude_at | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at |
---|
In [899]:
#VERIFY RECORDS BEFORE COLUMN DROP df_n_places_m4_1[(df_n_places_m4_1['rtc']!=df_n_places_m4_1['rtc_at']) & (df_n_places_m4_1['rtc_at'].notnull())]
Out[899]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | category_at | district_code_at | district_name_at | neighbourhood_code_at | neighbourhood_name_at | address_name_at | street_number_1_at | street_number_2_at | name_at | rtc_at |
---|
0 rows × 35 columns
In [900]:
df_n_places_m4_1.drop(columns='rtc_at', inplace=True)
COMPARISON¶
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME¶
In [901]:
df_n_places_m4_2 = df_n_places_m4_1.copy()
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [902]:
df_n_places_m4_2[['address','neighbourhood_code','neighbourhood_name','neighbourhood_code_at','neighbourhood_name_at']][(df_n_places_m4_2['neighbourhood_name']!= df_n_places_m4_2['neighbourhood_name_at']) & (df_n_places_m4_2['neighbourhood_name_at'].notnull())]
Out[902]:
address | neighbourhood_code | neighbourhood_name | neighbourhood_code_at | neighbourhood_name_at |
---|
In [903]:
df_n_places_m4_2[['address','district_code','district_name','district_code_at','district_name_at']][(df_n_places_m4_2['district_name']!= df_n_places_m4_2['district_name_at']) & (df_n_places_m4_2['district_name_at'].notnull())]
Out[903]:
address | district_code | district_name | district_code_at | district_name_at |
---|
In [904]:
df_n_places_m4_2.drop(columns=['district_code_at','district_name_at','neighbourhood_code_at','neighbourhood_name_at'], inplace=True)
In [905]:
df_n_places_m4_3 = df_n_places_m4_2.copy()
COLUMN FILLING¶
In [906]:
df_n_places_m4_4 = df_n_places_m4_3.copy()
In [907]:
df_n_places_m4_4.columns
Out[907]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'latitude_at', 'longitude_at', 'category_at', 'address_name_at', 'street_number_1_at', 'street_number_2_at', 'name_at'], dtype='object')
In [908]:
#longitude df_n_places_m4_4.loc[df_n_places_m4_4['longitude_at'].notnull(),'longitude'] = df_n_places_m4_4['longitude_at']
In [909]:
#latitude df_n_places_m4_4.loc[df_n_places_m4_4['latitude_at'].notnull(),'latitude'] = df_n_places_m4_4['latitude_at']
In [910]:
#name df_n_places_m4_4.loc[df_n_places_m4_4['name_at'].notnull(),'name'] = df_n_places_m4_4['name_at']
In [911]:
#DROP df_n_places_m4_4.drop(columns=['latitude_at', 'longitude_at', 'category_at', 'address_name_at', 'street_number_1_at', 'street_number_2_at', 'name_at'], inplace=True)
In [912]:
df_n_places_m4_4.columns
Out[912]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places'], dtype='object')
In [913]:
df_n_places_m4_4.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 12 entries, 0 to 11 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 12 non-null object 1 rtc 12 non-null object 2 name 12 non-null object 3 category 12 non-null object 4 address 12 non-null object 5 street_type 12 non-null object 6 street 12 non-null object 7 street_number_1 12 non-null object 8 street_letter_1 0 non-null object 9 street_number_2 2 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 0 non-null object 14 floor 1 non-null object 15 door 0 non-null object 16 district_code 12 non-null object 17 district_name 12 non-null object 18 neighbourhood_code 12 non-null object 19 neighbourhood_name 12 non-null object 20 longitude 12 non-null object 21 latitude 12 non-null object 22 n_places 12 non-null float64 dtypes: float64(1), object(22) memory usage: 2.2+ KB
df_touristapartment_n_places_coordinates¶
In [914]:
df_touristapartment_n_places_coordinates = df_n_places_m4_4.copy()
MERGE 5 : df_n_places + df_hut + df_hotel + df_hostel + df_touristapartments + df_albergs : _m5¶
In [915]:
df_n_places_m5_0 = df_n_places.copy() df_n_places_m5_0.head(1)
Out[915]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01-90-A-128 | HB-003893 | None | Hotel 3 estrelles | GRAVINA 5 7 | nan | GRAVINA | 5 | None | 7 | … | None | None | None | 01 | Ciutat Vella | 01 | el Raval | None | None | 86.0 |
1 rows × 23 columns
In [916]:
df_al_m5_0 = df_alberg.copy() df_al_m5_0.head(1)
Out[916]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | rtc_al_modified | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 41.39116166395002 | 2.184353207057929 | Albergs | 10 | Sant Martí | 66 | el Parc i la Llacuna del Poblenou | C Buenaventura Muñoz | 16 | None | Alberg Arc House | AJ000645 | ALB-645 |
In [917]:
df_n_places_m5_0 = df_n_places_m5_0.merge(df_al_m5_0, how='inner', left_on=['rtc'], right_on=['rtc_al_modified'])
In [918]:
df_n_places_m5_0.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 114 entries, 0 to 113 Data columns (total 36 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 114 non-null object 1 rtc 114 non-null object 2 name 0 non-null object 3 category 114 non-null object 4 address 114 non-null object 5 street_type 114 non-null object 6 street 114 non-null object 7 street_number_1 114 non-null object 8 street_letter_1 3 non-null object 9 street_number_2 20 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 1 non-null object 14 floor 70 non-null object 15 door 33 non-null object 16 district_code 114 non-null object 17 district_name 114 non-null object 18 neighbourhood_code 114 non-null object 19 neighbourhood_name 114 non-null object 20 longitude 0 non-null object 21 latitude 0 non-null object 22 n_places 114 non-null float64 23 latitude_al 114 non-null object 24 longitude_al 114 non-null object 25 category_al 114 non-null object 26 district_code_al 114 non-null object 27 district_name_al 114 non-null object 28 neighbourhood_code_al 114 non-null object 29 neighbourhood_name_al 114 non-null object 30 address_name_al 114 non-null object 31 street_number_1_al 114 non-null object 32 street_number_2_al 15 non-null object 33 name_al 114 non-null object 34 rtc_al 114 non-null object 35 rtc_al_modified 114 non-null object dtypes: float64(1), object(35) memory usage: 33.0+ KB
In [919]:
df_n_places_m5_1 = df_n_places_m5_0.copy()
REMAINING RECORDS¶
RTC¶
In [920]:
#RECORDS NOT INCLUDED df_albergs_remaining = df_al_m5_0[(~df_al_m5_0['rtc_al_modified'].isin(df_n_places_m5_1['rtc_al_modified'])) | (df_al_m5_0['rtc_al_modified'].isnull())] df_albergs_remaining
Out[920]:
latitude_al | longitude_al | category_al | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | rtc_al_modified | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | 41.39140098320638 | 2.1578428777063454 | Albergs | 02 | Eixample | 08 | l’Antiga Esquerra de l’Eixample | C Enric Granados | 52 | None | After Hostel | AJ000543 | ALB-543 |
60 | 41.39979617959405 | 2.118829605855649 | Albergs | 05 | Sarrià-Sant Gervasi | 23 | Sarrià | C Duquessa d’Orleans | 56 | None | Alberg Studio Hostel | AJ000015 | ALB-015 |
97 | 41.3805185765298 | 2.1707195328025053 | Albergs | 01 | Ciutat Vella | 01 | el Raval | C Hospital | 63 | None | Alberg Center Rambles | AJ000398 | ALB-398 |
101 | 41.38863242056985 | 2.144502668709701 | Albergs | 02 | Eixample | 09 | la Nova Esquerra de l’Eixample | C Londres | 20 | None | Free Hostels Barcelona | AJ000614 | ALB-614 |
116 | 41.391545708361036 | 2.1615865918643533 | Albergs | 02 | Eixample | 07 | la Dreta de l’Eixample | C València | 233 | None | Tierra Azul Hostel | AJ000557 | ALB-557 |
141 | 41.416189496832175 | 2.1468663071393794 | Albergs | 06 | Gràcia | 28 | Vallcarca i els Penitents | Pg Mare de Déu del Coll | 41 | 51 | Casa Marsans | Alberg Mare de Déu de Montserrat – AJ000084 | Alberg Mare de Déu de Montserrat – ALB-084 |
177 | 41.38145164237975 | 2.175475468533943 | Albergs | 01 | Ciutat Vella | 02 | el Barri Gòtic | C Ferran | 31 | None | Alberg Fernando | AJ000419 | ALB-419 |
183 | 41.38811806097095 | 2.13384878071623 | Albergs | 04 | Les Corts | 19 | les Corts | C Numà ncia | 149 | 151 | Alberg Pere Tarrés | AJ000070 | ALB-070 |
In [921]:
#VERIFY RECORDS BEFORE COLUMN DROP df_n_places_m5_1[(df_n_places_m5_1['rtc']!=df_n_places_m5_1['rtc_al_modified']) & (df_n_places_m5_1['rtc_al_modified'].notnull())]
Out[921]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | district_code_al | district_name_al | neighbourhood_code_al | neighbourhood_name_al | address_name_al | street_number_1_al | street_number_2_al | name_al | rtc_al | rtc_al_modified |
---|
0 rows × 36 columns
In [922]:
df_n_places_m5_1.drop(columns=['rtc_al','rtc_al_modified'], inplace=True)
COMPARISON¶
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME¶
In [923]:
df_n_places_m5_2 = df_n_places_m5_1.copy()
DISTRICT_CODE, DISTRICT_NAME, NEIGHBOURHOOD_CODE, NEIGHBOURHOOD_NAME
SOURCE:
https://ajuntament.barcelona.cat/estadistica/catala/Territori/div84/convertidors/barris73.htm
In [924]:
df_n_places_m5_2[['address','neighbourhood_code','neighbourhood_name','neighbourhood_code_al','neighbourhood_name_al']][(df_n_places_m5_2['neighbourhood_name']!= df_n_places_m5_2['neighbourhood_name_al']) & (df_n_places_m5_2['neighbourhood_name_al'].notnull())]
Out[924]:
address | neighbourhood_code | neighbourhood_name | neighbourhood_code_al | neighbourhood_name_al |
---|
In [925]:
df_n_places_m5_2[['address','district_code','district_name','district_code_al','district_name_al']][(df_n_places_m5_2['district_name']!= df_n_places_m5_2['district_name_al']) & (df_n_places_m5_2['district_name_al'].notnull())]
Out[925]:
address | district_code | district_name | district_code_al | district_name_al |
---|
In [926]:
df_n_places_m5_2.drop(columns=['district_code_al','district_name_al','neighbourhood_code_al','neighbourhood_name_al'], inplace=True)
In [927]:
df_n_places_m5_3 = df_n_places_m5_2.copy()
COLUMN FILLING¶
In [928]:
df_n_places_m5_3.columns
Out[928]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places', 'latitude_al', 'longitude_al', 'category_al', 'address_name_al', 'street_number_1_al', 'street_number_2_al', 'name_al'], dtype='object')
In [929]:
#longitude df_n_places_m5_3.loc[df_n_places_m5_3['longitude_al'].notnull(),'longitude'] = df_n_places_m5_3['longitude_al']
In [930]:
#latitude df_n_places_m5_3.loc[df_n_places_m5_3['latitude_al'].notnull(),'latitude'] = df_n_places_m5_3['latitude_al']
In [931]:
#name df_n_places_m5_3.loc[df_n_places_m5_3['name_al'].notnull(),'name'] = df_n_places_m5_3['name_al']
In [932]:
#DROP df_n_places_m5_3.drop(columns=['latitude_al', 'longitude_al', 'category_al', 'address_name_al', 'street_number_1_al', 'street_number_2_al', 'name_al'], inplace=True)
In [933]:
df_n_places_m5_3.columns
Out[933]:
Index(['n_practice', 'rtc', 'name', 'category', 'address', 'street_type', 'street', 'street_number_1', 'street_letter_1', 'street_number_2', 'street_letter_2', 'block', 'entrance', 'stair', 'floor', 'door', 'district_code', 'district_name', 'neighbourhood_code', 'neighbourhood_name', 'longitude', 'latitude', 'n_places'], dtype='object')
In [934]:
df_n_places_m5_3.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 114 entries, 0 to 113 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 114 non-null object 1 rtc 114 non-null object 2 name 114 non-null object 3 category 114 non-null object 4 address 114 non-null object 5 street_type 114 non-null object 6 street 114 non-null object 7 street_number_1 114 non-null object 8 street_letter_1 3 non-null object 9 street_number_2 20 non-null object 10 street_letter_2 0 non-null object 11 block 0 non-null object 12 entrance 0 non-null object 13 stair 1 non-null object 14 floor 70 non-null object 15 door 33 non-null object 16 district_code 114 non-null object 17 district_name 114 non-null object 18 neighbourhood_code 114 non-null object 19 neighbourhood_name 114 non-null object 20 longitude 114 non-null object 21 latitude 114 non-null object 22 n_places 114 non-null float64 dtypes: float64(1), object(22) memory usage: 21.4+ KB
df_albergs_n_places_coordinates¶
In [935]:
df_albergs_n_places_coordinates = df_n_places_m5_3.copy()
FINAL DATAFRAME: TOURIST ESTABLISHMENTS COORDINATES AND N_PLACES¶
CONCATENATE DATAFRAMES¶
In [936]:
df_hut_n_places_coordinates.shape[0]
Out[936]:
9409
In [937]:
df_hotel_n_places_coordinates.shape[0]
Out[937]:
435
In [938]:
df_hostel_n_places_coordinates.shape[0]
Out[938]:
233
In [939]:
df_touristapartment_n_places_coordinates.shape[0]
Out[939]:
12
In [940]:
df_albergs_n_places_coordinates.shape[0]
Out[940]:
114
In [941]:
dataframe_list_concat = [df_hut_n_places_coordinates, df_hotel_n_places_coordinates, df_hostel_n_places_coordinates, df_touristapartment_n_places_coordinates, df_albergs_n_places_coordinates] df_final_cleaning = pd.DataFrame() for i in dataframe_list_concat: df_final_cleaning = pd.concat((df_final_cleaning,i), ignore_index=True)
In [942]:
df_final_cleaning1 = df_final_cleaning.copy() df_final_cleaning1.shape[0]
Out[942]:
10203
DUPLICATES¶
In [943]:
df_final_cleaning1.duplicated(keep=False).value_counts()
Out[943]:
False 10203 dtype: int64
In [944]:
#CHECK ON ID COLUMNS df_final_cleaning1[df_final_cleaning1.duplicated(subset=['n_practice','rtc'], keep=False)].sort_values('name')
Out[944]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [945]:
#CHECK ON SINGLE ID COLUMN df_final_cleaning1[df_final_cleaning1.duplicated(subset=['n_practice'], keep=False)].sort_values('name')
Out[945]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10123 | 06-2010-0423 | ALB-556 | Alberg Generator Barcelona | Albergs | C CORSEGA 373 375 | C | CORSEGA | 373 | None | 375 | … | None | None | None | 06 | Gràcia | 31 | la Vila de Gràcia | 2.162634129795992 | 41.39918207998624 | 646.0 |
10187 | 06-2017-0212 | ALB-565 | Casa Gracia Barcelona Hostel | Albergs | PG GRACIA 116 | PG | GRACIA | 116 | None | None | … | None | None | None | 06 | Gràcia | 31 | la Vila de Gràcia | 2.159127254347306 | 41.397281369714406 | 446.0 |
10054 | 06-2017-0212 | HB-004682 | Hostal Casa Grà cia | Hotel 1 estrella | PG GRACIA 116 | PG | GRACIA | 116 | None | None | … | None | None | None | 06 | Gràcia | 31 | la Vila de Gràcia | 2.159127254347306 | 41.397281369714406 | 23.0 |
9921 | 06-2010-0423 | HB-004525 | Hostal Generator Barcelona | Hotel 1 estrella | C CORSEGA 373 375 | C | CORSEGA | 373 | None | 375 | … | None | None | None | 06 | Gràcia | 31 | la Vila de Gràcia | 2.1626914384056675 | 41.399226200974915 | 81.0 |
9642 | 10-2001-0694 | HB-004358 | Hotel Melia Barcelona Sky | Hotel 4 estrelles superior | C PERE IV 272 | C | PERE IV | 272 | None | None | … | None | None | None | 10 | Sant Martí | 68 | el Poblenou | 2.200634797754356 | 41.406288190178934 | 430.0 |
9742 | 02-2015-0048 | HB-004629 | Hotel TOC Hostel Barcelona | Hotel 1 estrella | G.V. CORTS CATALANES 580 BJ | G.V. | CORTS CATALANES | 580 | None | None | … | None | BJ | None | 02 | Eixample | 10 | Sant Antoni | 2.162504549676391 | 41.38478270570681 | 42.0 |
9641 | 10-2001-0694 | HB-004532 | Hotel The Level at Melia Barcelona Sky | Hotel 5 estrelles | C PERE IV 272 | C | PERE IV | 272 | None | None | … | None | None | None | 10 | Sant Martí | 68 | el Poblenou | 2.2010195187104546 | 41.406247338733934 | 86.0 |
10184 | 02-2015-0048 | ALB-625 | Toc Hostel Barcelona | Albergs | G.V. CORTS CATALANES 580 BJ | G.V. | CORTS CATALANES | 580 | None | None | … | None | BJ | None | 02 | Eixample | 10 | Sant Antoni | 2.162504549676391 | 41.38478270570681 | 216.0 |
8 rows × 23 columns
The records above appear to be related to cases where establishments have sections belonging to different categories within the same establishment – therefore, these are not dropped
In [946]:
#CHECK ON SINGLE ID COLUMN - WITH 'PENDENT' (PENDING) VALUES EXCLUDED df_final_cleaning1[(df_final_cleaning1.duplicated(subset=['rtc'], keep=False)) & (df_final_cleaning1['rtc']!= 'Pendent')].sort_values('name')
Out[946]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9933 | 03-1999-0004 | HB-000957 | Pensió Iniesta | Pensió | C FONTRODONA 1 1 1 | C | FONTRODONA | 1 | None | None | … | None | 1 | 1 | 03 | Sants-Montjuïc | 11 | el Poble-sec | 2.167831865928651 | 41.374727303453646 | NaN |
9934 | 03-1998-0472 | HB-000957 | Pensió Iniesta | Pensió | C FONTRODONA 1 2 3 | C | FONTRODONA | 1 | None | None | … | None | 2 | 3 | 03 | Sants-Montjuïc | 11 | el Poble-sec | 2.167831865928651 | 41.374727303453646 | NaN |
9935 | 03-2002-0222 | HB-000957 | Pensió Iniesta | Pensió | C FONTRODONA 1 EN 3 | C | FONTRODONA | 1 | None | None | … | None | EN | 3 | 03 | Sants-Montjuïc | 11 | el Poble-sec | 2.167831865928651 | 41.374727303453646 | 23.0 |
10090 | 05-2013-0284 | ALB-562 | Wow Hostel Barcelona | Albergs | AV DIAGONAL 578 3 | AV | DIAGONAL | 578 | None | None | … | None | 3 | None | 05 | Sarrià-Sant Gervasi | 26 | Sant Gervasi – Galvany | 2.148213200212615 | 41.39393974173925 | 19.0 |
10091 | 05-2016-0268 | ALB-562 | Wow Hostel Barcelona | Albergs | AV DIAGONAL 578 5 | AV | DIAGONAL | 578 | None | None | … | None | 5 | None | 05 | Sarrià-Sant Gervasi | 26 | Sant Gervasi – Galvany | 2.148213200212615 | 41.39393974173925 | 19.0 |
5 rows × 23 columns
The records above refer to different floors within the same establishments.
However, only in one case the variable of interest – n_places – is indicated for all floors.
In the other cases, the varible of interest – n_places – is indicated only in one of the floors.
Records are not modified.
MISSING VALUES¶
In [947]:
df_final_cleaning2 = df_final_cleaning1.copy()
In [948]:
df_final_cleaning2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10203 entries, 0 to 10202 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 10203 non-null object 1 rtc 10203 non-null object 2 name 10203 non-null object 3 category 10203 non-null object 4 address 10203 non-null object 5 street_type 10203 non-null object 6 street 10203 non-null object 7 street_number_1 10203 non-null object 8 street_letter_1 123 non-null object 9 street_number_2 880 non-null object 10 street_letter_2 4 non-null object 11 block 13 non-null object 12 entrance 3 non-null object 13 stair 690 non-null object 14 floor 9654 non-null object 15 door 8672 non-null object 16 district_code 10203 non-null object 17 district_name 10203 non-null object 18 neighbourhood_code 10203 non-null object 19 neighbourhood_name 10203 non-null object 20 longitude 10203 non-null object 21 latitude 10203 non-null object 22 n_places 10201 non-null float64 dtypes: float64(1), object(22) memory usage: 1.8+ MB
N_PLACES¶
In [949]:
df_final_cleaning2[(df_final_cleaning2['n_places'].isnull()) | (df_final_cleaning2['n_places']=='') | (df_final_cleaning2['n_places']=='nan') | (df_final_cleaning2['n_places']==None)]
Out[949]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9933 | 03-1999-0004 | HB-000957 | Pensió Iniesta | Pensió | C FONTRODONA 1 1 1 | C | FONTRODONA | 1 | None | None | … | None | 1 | 1 | 03 | Sants-Montjuïc | 11 | el Poble-sec | 2.167831865928651 | 41.374727303453646 | NaN |
9934 | 03-1998-0472 | HB-000957 | Pensió Iniesta | Pensió | C FONTRODONA 1 2 3 | C | FONTRODONA | 1 | None | None | … | None | 2 | 3 | 03 | Sants-Montjuïc | 11 | el Poble-sec | 2.167831865928651 | 41.374727303453646 | NaN |
2 rows × 23 columns
LONGITUDE¶
In [950]:
df_final_cleaning2[(df_final_cleaning2['longitude'].isnull()) | (df_final_cleaning2['longitude']=='') | (df_final_cleaning2['longitude']=='nan') | (df_final_cleaning2['longitude']==None)]
Out[950]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
LATITUDE¶
In [951]:
df_final_cleaning2[(df_final_cleaning2['latitude'].isnull()) | (df_final_cleaning2['latitude']=='') | (df_final_cleaning2['latitude']=='nan') | (df_final_cleaning2['latitude']==None)]
Out[951]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
DISTRICT – NEIGHBOURHOOD¶
In [952]:
df_final_cleaning2[(df_final_cleaning2['neighbourhood_code'].isnull()) | (df_final_cleaning2['neighbourhood_code']=='') | (df_final_cleaning2['neighbourhood_code']=='nan') | (df_final_cleaning2['neighbourhood_code']==None)]
Out[952]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [953]:
df_final_cleaning2[(df_final_cleaning2['district_code'].isnull()) | (df_final_cleaning2['district_code']=='') | (df_final_cleaning2['district_code']=='nan') | (df_final_cleaning2['district_code']==None)]
Out[953]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places |
---|
0 rows × 23 columns
In [954]:
df_final_cleaning2.head()
Out[954]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 07-2013-0168 | HUTB-007570 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 22 3 1 | AV | CAN BARO | 22 | None | None | … | None | 3 | 1 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160719886 | 41.41405995 | 3.0 |
1 | 07-2014-0121 | HUTB-009724 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 2 | AV | CAN BARO | 3 | None | None | … | None | 1 | 2 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
2 | 07-2014-0161 | HUTB-010707 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 3 | AV | CAN BARO | 3 | None | None | … | None | 1 | 3 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
3 | 07-2014-0120 | HUTB-009725 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 4 | AV | CAN BARO | 3 | None | None | … | None | 1 | 4 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
4 | 07-2012-0231 | HUTB-002942 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 PR 2 | AV | CAN BARO | 3 | None | None | … | None | PR | 2 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 2.0 |
5 rows × 23 columns
In [955]:
df_final_cleaning3 = df_final_cleaning2.copy()
NORMALIZATION¶
In [956]:
df_final_cleaning3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10203 entries, 0 to 10202 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 n_practice 10203 non-null object 1 rtc 10203 non-null object 2 name 10203 non-null object 3 category 10203 non-null object 4 address 10203 non-null object 5 street_type 10203 non-null object 6 street 10203 non-null object 7 street_number_1 10203 non-null object 8 street_letter_1 123 non-null object 9 street_number_2 880 non-null object 10 street_letter_2 4 non-null object 11 block 13 non-null object 12 entrance 3 non-null object 13 stair 690 non-null object 14 floor 9654 non-null object 15 door 8672 non-null object 16 district_code 10203 non-null object 17 district_name 10203 non-null object 18 neighbourhood_code 10203 non-null object 19 neighbourhood_name 10203 non-null object 20 longitude 10203 non-null object 21 latitude 10203 non-null object 22 n_places 10201 non-null float64 dtypes: float64(1), object(22) memory usage: 1.8+ MB
In [957]:
df_final = df_final_cleaning3.copy() df_final.head()
Out[957]:
n_practice | rtc | name | category | address | street_type | street | street_number_1 | street_letter_1 | street_number_2 | … | stair | floor | door | district_code | district_name | neighbourhood_code | neighbourhood_name | longitude | latitude | n_places | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 07-2013-0168 | HUTB-007570 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 22 3 1 | AV | CAN BARO | 22 | None | None | … | None | 3 | 1 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160719886 | 41.41405995 | 3.0 |
1 | 07-2014-0121 | HUTB-009724 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 2 | AV | CAN BARO | 3 | None | None | … | None | 1 | 2 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
2 | 07-2014-0161 | HUTB-010707 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 3 | AV | CAN BARO | 3 | None | None | … | None | 1 | 3 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
3 | 07-2014-0120 | HUTB-009725 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 1 4 | AV | CAN BARO | 3 | None | None | … | None | 1 | 4 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 4.0 |
4 | 07-2012-0231 | HUTB-002942 | Habitatges d’Ús Turístic | Habitatges d’Ús Turístic | AV CAN BARO 3 PR 2 | AV | CAN BARO | 3 | None | None | … | None | PR | 2 | 07 | Horta-Guinardó | 34 | Can Baró | 2.160002648 | 41.41349484 | 2.0 |
5 rows × 23 columns
DATA EXPORT TO EXCEL FILE¶
In [958]:
#NAME EXCEL EXPORT FILE year = 2022 prefix = 'TOURIST_LODGINGS_PD_ONLY' excel_file_name = 'T_{}_YEAR_{}.xlsx'.format(prefix,year) excel_sheet_name = '{}'.format(prefix) excel_index_label = '{}_INDEX'.format(prefix) df_final.to_excel(excel_file_name, sheet_name= excel_sheet_name, index_label=excel_index_label)