INTRODUCTION

Named in honor of Dr. Martin Luther King, Jr., King County is the most populous county in Washington state with Seattle, its county seat, being the largest city in the Pacific Northwest region of North America. The income-tax-free environment and high-degree economic development make the housing market of King County a complex battlefield with forces and drags from different directions.

2015 was an interesting time period for the county. Six years out of the crisis, the economy titan of King County had finally recovered from the deadly tumbling. With high job growth and an unemployment rate that is well below the national level, people at the time were more than eager about real estate investment. Price always seats in the center of a market. We would like to investigate the real estate market of King County in 2015, so the primary question should be: What regressors are the most significant when predicting housing prices in King County, and are these regressors consistent across the county? Given King County is a broad market, we chose five representative sub-markets to compare and contrast, hoping to explain the price differences from a micro-perspective. The five cities and reasons why we chose them are listed as followed:

About two-thirds of King County’s population lives in Seattle suburbs, and it is the biggest labor market in the state. In 2015, the Seattle housing market was full of energy and high expectations with Amazon’s headquarters move. When looking at the dataset, we found Seattle different from the peers in many aspects. It is like the gem of the crown, unique and stunning. Thus, our second question surrounds the Seattle sub-market specifically: What makes the Seattle housing market unique compared to the rest of King County? As a market indicator, Seattle plays an important role in understanding the real estate market of King County.

DATA

We found this dataset on Kaggle. A user named “harlfoxem” updated it to Kaggle three years ago, but we were not sure who collected the data. The dataset has been used by many data scientists using python or R to explore the housing prices in King County, but there are still many aspects left untouched, among which we chose to investigate. There are 567 kernels based on this dataset. The dataset contains house sale prices in King County for homes sold between May 2014 and May 2015. There are 19 features in total plus the price and the id for each house along with 21,613 observations. Therefore, it is proper to predict housing prices using regression models. We believed that this dataset is thorough enough for us to dig into the housing market in King County and explore the answers to the two questions.

The picture demonstrates both the sample distribution and the price range. The high concentration of houses samples corresponds with a more solid color, while at the same time, the darker the red, the higher the price per square footage.

We used almost every feature in the dataset except for id and date. id is just a notation for different houses and date is the time when the houses were sold. We didn’t use id and date because we were not investigating individual houses and we thought date was irrelevant to our analysis.

Following are what each variable means in a real world context: bedrooms is the number of bedrooms in each house; bathrooms is the number of bathrooms in each house; sqft_living is the square footage of the home; sqft_lot is the square footage of the lot; floors is the total floors/levels in the house; waterfront is a binary variable which indicates whether the house has a view to a waterfront (0 for no and 1 for yes); condition is a categorical variable, indicating how good the condition is overall - the condition gets more and more ideal from 1 to 5; grade is also a categorical variable that represents the overall grade given to the housing unit based on the King County’s grading system, it goes from 1 to 13 as the grade becomes better; sqft_above corresponds to the square footage of houses apart from the basement; sqft_basement is the square footage of the basement; yr_built is the year that the house was built, and we transformed this variable to how old the house is by subtracting it from 2019; yr_renovated is the year when the house was renovated. ifrenovated and year_built_norm were variables we created in order to improve our regression analysis. ifrenovated is a binary variable which indicates whether the house has been renovated or not. year_built_norm is a normalized version of year_built in which we subtracted 1899 from yr_built.

We added another column in the dataset: the City variable by joining another dataset containing different zip codes that correspond to specific cities in King County. We thought analyzing the houses across different cities would be more direct and interesting. After our modification, the final clean dataset that we utilized for further analysis contains 21 variables and 21,551 observations. Following is a general profile for each city.

City Houses_Number Price Sqft Grade
Auburn 911 291494.1 1868.6 7.4
Bellevue 1397 895743.7 2512.8 8.4
Black Diamond 100 423666.0 2008.8 7.4
Bothell 195 490351.5 2248.1 7.8
Carnation 124 455617.1 1929.0 7.4
Duvall 190 424788.7 2103.7 7.5
Enumclaw 234 315709.3 1798.8 7.2
Fall City 81 580526.8 2128.2 7.5
Federal Way 779 289384.9 1932.3 7.6
Issaquah 731 614804.3 2304.1 8.3
Kenmore 283 462480.0 2091.3 7.6
Kent 1203 299549.9 1908.4 7.4
Kirkland 975 646497.9 2045.4 7.8
Maple Valley 590 366867.6 2118.4 7.6
Medina 49 2154700.6 3130.0 9.5
Mercer Island 280 1192635.6 2899.7 9.0
North Bend 221 439471.1 1913.8 7.6
Redmond 978 657197.2 2389.7 8.2
Renton 1593 403537.6 2036.5 7.5
Sammamish 799 732457.0 2761.7 8.8
Seattle 8939 533407.4 1681.3 7.3
Snoqualmie 310 527961.2 2518.3 7.9
Vashon 118 487479.6 1776.9 7.3
Woodinville 471 617384.5 2535.6 8.4

RESULTS

Part I

We chose 15 different models in order to understand which regressors were the most significant in predicting the price of a given house in the King County housing market. Some of the models compared individual regressors against housing price, while other models contained a combination of regressors to gain a better grasp on how the regressors interact with each other. Moreover, the 15 models were run for the 5 separate cities to see whether the significant and non-significant regressors remained consistent across the county. The Federal Way, Mercer Island, Seattle, Bellevue, and Snoqualmie housing markets were chosen for reasons listed in the Intro section. The models we chose were as follows:

We decided not to include grade or view as regressors in our models in order to avoid cases of multicollinearity. In our EDA, we determined that most variables contribute to grade, and that view was highly correlated with waterfront. In addition, we never used sqft_living with bedrooms or bathrooms in order to avoid multicollinearity. The MSE and MAE were used to determine which model was the best when it came to predicting housing prices. Models 7b, 7c, and 8 performed the best to predict housing price across the board, with average Adjusted R-squared values of 0.51, 0.52, and 0.58 respectively.

Individual significant levels of regressors were taken into consideration when deciding whether regressors were significant or not. As a result, we concluded that the significant regressors for predicting housing prices for each city are as follows: + Federal Way: sqft_lot, bathrooms, sqft_living, waterfront, yr_built_norm, condition + Mercer Island: sqft_lot, bathrooms, sqft_living, waterfront + Seattle: sqft_lot, bathrooms, sqft_living, waterfront, yr_built_norm, condition + Bellevue: sqft_lot, bathrooms, sqft_living, waterfront, yr_built_norm, ifrenovated + Snoqualmie: sqft_lot, bathrooms, sqft_living, yr_built_norm

The only significant regressors that popped up for all cities were sqft_lot, bathrooms, and sqft_living. waterfront was significant for each city except for Snoqualmie, which makes sense because there were 0 waterfront houses in Snoqualmie in our dataset. Therefore, it is fair to say the waterfront regressor was significant in all scenarios where it was applicable. Interestingly, the yr_built_norm regressor was significant for each city with the exception of Mercer Island. This could be a result from the fact that most houses are built more recently with better conditions and are more expensive compared to other cities, which indicates condition and yr_built are not as important as they are for other cities. Also, since Bellevue had a relatively early development time range, the houses there are pretty old. Thus, ifrenovated was significant because it was going through a stage of needing-renovation. The process of redevelopment made a lot of differences and boosted the price to a large extent. On the contrary, Seattle has already gone through the renovation trend while other cities have not started, so the ifrenovated factor is not a significant factor. Finally, bedrooms was the only regressor that was not significant for at least 1 or more cities among our selection.

Part II

After researching many different aspects of the King County housing market, we found that Seattle as a city had one of the more unique markets compared to other surrounding cities. The characteristics of the Seattle housing market that made it different from other cities in King County are its development over time, the percentage of houses that are renovated, and the relatively low price of houses despite Seattle being a major city and located on the water. We used a variety of bar graphs and other figures to illustrate these differences between Seattle and its neighboring cities to prove that Seattle is indeed a very unique city.

We initially discovered that Seattle was a unique city when we looked at how the area of King County has grown from 1900 to 2015. We used line graphs for all of the different cities that showed the number of houses built each year from 1900 to 2015. These graphs revealed that Seattle followed a very different path of development when compared to other cities. Seattle experienced increases in the number of houses being built from 1900 to 1950 and from 2000 to 2015. It also saw a large decrease in the number of houses built from approximately 1950 to 2000. This trend is the opposite of what we see in almost every other city where development increased from 1950 to 2000 and hardly any houses were built before 1950. This is most likely due to the fact that the northwestern territory was relatively new to the United States in the early 1900s, therefore most people settled in larger cities where there were more jobs and resources. As the area became more developed and the population increased, more people began to move away from the city, Seattle, and build houses in the surrounding cities. This is why we see the sudden decrease in the number of houses being built in Seattle from 1950 to 2000 and more houses being built in the surrounding area.

Another discovery that we made about the Seattle housing market is that houses are renovated at a significantly higher percentage than any other city in King County. We realized this after we made a bar graph that took the difference between the percentage of houses not renovated and the percentage of houses renovated. Not only was Seattle one of the few cities that had a higher percentage of houses renovated than not renovated, but the percentage of renovated houses in Seattle was significantly higher than any other city. After making this discovery, we tried to answer why there were so many houses renovated in Seattle by looking at the number of houses built in five different time periods for each city. We did this by making a bar graph for the five different time intervals (1900-24, 1924-49, 1950-74, 1975-2000, after 2000) with the number of houses built on the x-axis and the cities on the y-axis. After looking at this graph, we found that Seattle had a lot of houses built from 1900 to 1950, and it was one of the only cities that houses were built during that time period. This means that there were a lot of older houses in Seattle that needed to be renovated, while most of the other cities had newer houses that did not need to be renovated. It is similar to what we saw in the line graph that charted housing development over time and it explains why so many houses in Seattle were renovated.

Seattle is also unique in the fact that it has a relatively low average price for houses when you consider its location and the fact that it is on the water. We made a bar graph that showed the average price of houses across all cities and found that Seattle was right around the average for King County. This is surprising since you would expect Seattle to have higher than average housing prices since it is a large city with many resources that would appeal to potential home owners. A reason that the average price is so low might be that most houses in Seattle are smaller/more compact, or that most of the houses in Seattle are older houses, therefore do not have as many amenities.

CONCLUSION

Question one deals with whether the significant regressors when predicting the prices of houses in King County differ for different cities. After looking at the modeling results, it is not surprising to see that the five cities we chose all have sqft_lot, bathrooms, waterfront (Snoqualmie doesn’t have waterfront houses), and sqft_living as significant regressors. Base on those factors, we can estimate the individual house price of a new listing at the time and compare it with its price level before and after 2015. The factors explained how market evaluated houses, which reflected market behaviors, economy stage, and people’s beliefs. The difference of the models revealed some characteristics of specific cities and allow us to compare and contrast the price at different locations.

Question two digs deeper into the housing market specifically for Seattle since it is such a unique city. It has a special development pattern, high renovation percentage, and a surprisingly lower average price than we had expected. Given Seattle started to develop relatively early, many old houses have been renovated by 2015. Although houses in Seattle tend to be renovated more often, buyers would still consider the age of the houses when purchasing. Its relatively low average price is a result of the large population and highly-compacted floor-plans in Seattle. To conclude, the results provided some different aspects for the industry to look at the Seattle real estate market.

Both of the questions and their answers would be really helpful for potential buyers and housing agents alike in real life. For buyers who would like to invest a new home in King County, they will have a better understanding of the market in terms of the housing prices. When prioritizing their preferred features of different houses, they would consider their financial options and make wiser choices. For housing agents, they would be able to target their clients more precisely. When recommending homes, they can give advice based on the clients’ background and preferred characteristics, so that the chance of making the deal would increase a lot. One thing about our modeling process that could be improved might be adding interaction terms and making some adjustments to some variables to produce a more significant regression model. Maybe squaring or adding logs to certain variables would help. This dataset that we chose to work with is thorough enough for us to analyze since it has over 20,000 observations. It may help if we have more records for some cities since different cities have various number of houses, which created some bias while evaluating.

This topic could be extended further into different timeframes and locations. Although we have time-related analysis, it would be interesting to look into price changes and houses characteristics changes over time to provide the research with more dimensions. Also, the same analysis could be applied to other big cities and we can find out whether most of them follow the pattern of Seattle and its neighborhood. More broadly, comparison among different states, east coast/west coast could be done in the future.