Different fields of science, economics, engineering, and marketing accumulate and store data primarily in electronic databases. Appropriate and well-established decisions should be made using the data collected. It is practically impossible to make sense of datasets containing more than a handful of data points without the help of computer programs. To be certain of the insights that the collected data provides and to make further decisions, data mining is performed where we go through distinctive analysis processes.
Exploratory data analysis is key, and usually the first exercise in data mining. It allows us to visualize data to understand it as well as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data mining project.
EDA in data science actually reveals ground truth about the content without making any underlying assumptions. This is the fact that data scientists use this process to actually understand what type of modeling and hypotheses can be created. Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of data. Python provides expert tools for exploratory analysis, with pandas for summarizing; scipy, along with others, for statistical analysis; and matplotlib and plotly for visualizations.
In this Free case study, Analytics Educator will show you how to use Machine Learning algorithm to predict the price of used cars (2nd hand cars) accurately. We will emphasize more on the EDA with Python part in this project and use only a single Machine learning algorithm to predict and determine the accuracy.
Craigslist is the world's largest collection of used vehicles for sale, this dataset which includes every used vehicle entry within the United States on Craigslist. We are going to predict the price of 2nd hand cars, so that the seller gets the maximum possible price.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import os
#setting the path
os.chdir("C:\\Users\\ASUS\\Desktop")
data=pd.read_csv("vehicles.csv")
# create a back up copy
df = data.copy()
pd.set_option('display.max_columns', None)
df.head()
| id | url | region | region_url | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | vin | drive | size | type | paint_color | image_url | description | county | state | lat | long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7184791621 | https://duluth.craigslist.org/ctd/d/duluth-200... | duluth / superior | https://duluth.craigslist.org | 6995 | 2000.0 | gmc | new sierra 1500 | excellent | 8 cylinders | gas | 167783.0 | clean | automatic | 1GTEK19T3YE366021 | 4wd | NaN | NaN | red | https://images.craigslist.org/00n0n_f06ykBMcdh... | 2000 *** GMC New Sierra 1500 Ext Cab 157.5 WB... | NaN | mn | 46.8433 | -92.2550 |
| 1 | 7184773187 | https://duluth.craigslist.org/cto/d/saginaw-20... | duluth / superior | https://duluth.craigslist.org | 8750 | 2013.0 | hyundai | sonata | excellent | 4 cylinders | gas | 90821.0 | clean | automatic | NaN | fwd | NaN | NaN | grey | https://images.craigslist.org/00d0d_kgZ6xoeRw2... | For Sale: 2013 Hyundai Sonata GLS - $8750. O... | NaN | mn | 46.9074 | -92.4638 |
| 2 | 7193375964 | https://newhaven.craigslist.org/cto/d/stratfor... | new haven | https://newhaven.craigslist.org | 10900 | 2013.0 | toyota | prius | good | 4 cylinders | hybrid | 92800.0 | clean | automatic | NaN | fwd | NaN | NaN | blue | https://images.craigslist.org/00d0d_3sHGxPbY2O... | 2013 Prius V Model Two. One owner—must sell my... | NaN | ct | 41.1770 | -73.1336 |
| 3 | 7195108810 | https://albuquerque.craigslist.org/cto/d/albuq... | albuquerque | https://albuquerque.craigslist.org | 12500 | 2003.0 | mitsubishi | lancer | good | 4 cylinders | gas | NaN | clean | manual | NaN | 4wd | mid-size | sedan | grey | https://images.craigslist.org/00m0m_4a8Pb6JbMG... | 2003 Mitsubishi Lancer Evolution, silver. Abo... | NaN | nm | 35.1868 | -106.6650 |
| 4 | 7184712241 | https://duluth.craigslist.org/ctd/d/rush-city-... | duluth / superior | https://duluth.craigslist.org | 16995 | 2007.0 | gmc | sierra classic 2500hd | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 1GTHK23D57F102022 | 4wd | full-size | truck | white | https://images.craigslist.org/01414_g093aPtSMW... | **Bad Credit, No Credit... No Problem!**2007 G... | NaN | mn | 45.6836 | -92.9648 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 539759 entries, 0 to 539758
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 539759 non-null int64
1 url 539759 non-null object
2 region 539759 non-null object
3 region_url 539759 non-null object
4 price 539759 non-null int64
5 year 538772 non-null float64
6 manufacturer 516175 non-null object
7 model 531746 non-null object
8 condition 303707 non-null object
9 cylinders 321264 non-null object
10 fuel 536366 non-null object
11 odometer 440783 non-null float64
12 title_status 536819 non-null object
13 transmission 535786 non-null object
14 vin 315349 non-null object
15 drive 383987 non-null object
16 size 168550 non-null object
17 type 392290 non-null object
18 paint_color 365520 non-null object
19 image_url 539740 non-null object
20 description 539738 non-null object
21 county 0 non-null float64
22 state 539759 non-null object
23 lat 530785 non-null float64
24 long 530785 non-null float64
dtypes: float64(5), int64(2), object(18)
memory usage: 103.0+ MB
r=df.columns
for i in r:
print(df[i].value_counts())
7194886517 1
7189000001 1
7183550244 1
7196139303 1
7192970529 1
..
7190568893 1
7191969294 1
7195637696 1
7190396865 1
7190089730 1
Name: id, Length: 423857, dtype: int64
https://orangecounty.craigslist.org/ctd/d/santa-ana-2011-mini-countryman-cooper/7194000283.html 1
https://baltimore.craigslist.org/ctd/d/lutherville-timonium-2011-chevrolet/7188930949.html 1
https://washingtondc.craigslist.org/mld/cto/d/silver-spring-toyota-corolla-2006/7193998974.html 1
https://honolulu.craigslist.org/oah/ctd/d/waipahu-auto-deals2011-chevrolet-camaro/7196201050.html 1
https://fortmyers.craigslist.org/lee/ctd/d/vero-beach-2011-ford-f250-super-duty/7188503273.html 1
..
https://hartford.craigslist.org/ctd/d/westport-2007-avalanche-lt-package-4wd/7196084213.html 1
https://baltimore.craigslist.org/cto/d/baltimore-2010-ford-escape-limited/7184749194.html 1
https://masoncity.craigslist.org/ctd/d/faribault-2014-ford-edge-se-awd-4dr/7190678802.html 1
https://westslope.craigslist.org/ctd/d/evans-2013-ford-150-f-wd-reg-cab-126/7195780250.html 1
https://medford.craigslist.org/ctd/d/portland-2014-chevrolet-suburban-lt-x4/7192045084.html 1
Name: url, Length: 423857, dtype: int64
columbus 3296
springfield 3294
jacksonville 3284
rochester 3087
salem 2985
...
meridian 22
kansas city 15
west virginia (old) 14
st louis 5
fort smith, AR 1
Name: region, Length: 404, dtype: int64
https://salem.craigslist.org 2985
https://kpr.craigslist.org 2983
https://fresno.craigslist.org 2983
https://stlouis.craigslist.org 2978
https://cosprings.craigslist.org 2977
...
https://kirksville.craigslist.org 37
https://nesd.craigslist.org 30
https://swks.craigslist.org 29
https://meridian.craigslist.org 22
https://wv.craigslist.org 14
Name: region_url, Length: 413, dtype: int64
0 35025
3500 3865
8995 3589
7995 3537
4500 3465
...
28766 1
30815 1
20120 1
10581 1
40812 1
Name: price, Length: 16305, dtype: int64
2017.0 30828
2015.0 21901
2014.0 21773
2013.0 21480
2016.0 21430
...
1917.0 1
1915.0 1
1913.0 1
1912.0 1
1911.0 1
Name: year, Length: 111, dtype: int64
ford 59786
chevrolet 46603
toyota 26001
honda 18898
nissan 17536
jeep 14224
gmc 12782
dodge 12547
ram 11917
hyundai 8765
subaru 7971
bmw 7717
volkswagen 7121
kia 7069
mercedes-benz 6859
chrysler 5903
buick 4668
cadillac 4615
mazda 3964
lexus 3916
audi 3334
acura 2784
lincoln 2593
pontiac 2323
infiniti 2318
volvo 2097
mitsubishi 1945
mercury 1330
mini 1320
rover 1151
saturn 1083
jaguar 787
fiat 567
tesla 299
alfa-romeo 138
harley-davidson 125
datsun 69
ferrari 64
aston-martin 28
land rover 15
porche 6
morgan 3
hennessey 1
Name: manufacturer, dtype: int64
f-150 6073
silverado 1500 3878
escape 3208
camry 2906
1500 2898
...
tundra sr5 truck 1
compass sport 4x4 low miles 1
f650 super duty regular ca 1
Vandolkswagon Jett57800 1
New Beetle 1
Name: model, Length: 27042, dtype: int64
excellent 85254
good 65262
like new 17880
fair 6724
new 1069
salvage 530
Name: condition, dtype: int64
6 cylinders 70130
4 cylinders 69542
8 cylinders 54091
5 cylinders 1594
10 cylinders 884
other 851
3 cylinders 406
12 cylinders 181
Name: cylinders, dtype: int64
gas 286775
diesel 24414
other 11311
hybrid 3737
electric 977
Name: fuel, dtype: int64
0.0 1799
150000.0 694
130000.0 671
140000.0 667
160000.0 645
...
17925.0 1
194413.0 1
48603.0 1
25107.0 1
47171.0 1
Name: odometer, Length: 89554, dtype: int64
clean 315607
rebuilt 6017
salvage 3248
lien 1963
missing 711
parts only 213
Name: title_status, dtype: int64
automatic 292619
manual 23647
other 11799
Name: transmission, dtype: int64
1HTMRAAM19J040080 133
1FVDBSDC34HM88268 125
1FVDCXBS48HZ75358 118
77777777777777777 70
2NKHHJ7X2HM136714 60
...
3FADP4TJ5DM149893 1
2T2BK1BA3FC297838 1
1FMCU9GD8JUC87396 1
1FAHP2E84GG134039 1
1FTNF1CF7EKD17926 1
Name: vin, Length: 103588, dtype: int64
4wd 99347
fwd 89632
rwd 42140
Name: drive, dtype: int64
full-size 52820
mid-size 30772
compact 16371
sub-compact 2664
Name: size, dtype: int64
sedan 64842
SUV 62106
truck 26967
pickup 24276
coupe 12931
other 12866
hatchback 9574
wagon 8332
van 7287
convertible 5854
mini-van 5213
offroad 530
bus 379
Name: type, dtype: int64
white 57646
black 43843
silver 32687
blue 22988
red 22395
grey 21752
custom 6299
green 5983
brown 4618
yellow 1775
orange 1502
purple 715
Name: paint_color, dtype: int64
https://images.craigslist.org/00H0H_lwWjXSEWNa7_0x20oM_600x450.jpg 2106
https://images.craigslist.org/00l0l_fl36gaeOcIT_0cU09G_600x450.jpg 133
https://images.craigslist.org/00a0a_2RYSGJJExad_0bM08P_600x450.jpg 125
https://images.craigslist.org/00202_l590CdXzJSo_001001_600x450.jpg 122
https://images.craigslist.org/01010_d7IBA4PgcmC_09e05H_600x450.jpg 115
...
https://images.craigslist.org/00k0k_adpaDg8st1T_0x20oM_600x450.jpg 1
https://images.craigslist.org/00C0C_4mpMrhwWWPp_0kE0fu_600x450.jpg 1
https://images.craigslist.org/00909_fdlnr1QHlhn_0ak07K_600x450.jpg 1
https://images.craigslist.org/01212_eB1nL5MODmY_09Q07v_600x450.jpg 1
https://images.craigslist.org/00w0w_78zjb9qds2U_0cU09G_600x450.jpg 1
Name: image_url, Length: 222413, dtype: int64
VISIT OUR WEBSITE WWW.DIRECTDETAILOFOH.COM HTTPS://DIRECTDETAILOHIO.COM/ WE HAVE A SELECTION OF CARS,TRUCKS,SUV'S. GUARANTEED APPROVAL. BUY HERE PAY HERE. NO CREDIT CHECKS! NO INCOME VERIFICATION! Vehicle is AS-IS NO Warranty. Information deemed reliable, but not guaranteed. Interested parties should confirm all data before relying on it to make a purchase decision. All prices and specifications are subject to change without notice. Prices may not include additional fees such as government fees and taxes, title and registration fees, finance charges, dealer document preparation fees, processing fees, and emission testing and compliance charges. 144
Frontier Auto Sales wants you to know we're about more than just selling cars our mission is to build long-lasting relationships. We want to grow our Frontier Family and see you send us your family and friends. We accept trades! (tow it, drive it or push it in.) We purchase vehicles on the spot. We even cosign vehicles if you need to sell yours let us do the work for you. Let us help find the perfect financing for you and your family’s needs. Spend 3 minutes to fill out a hassle-free secure application with Frontier Auto Sales and let us work hard to get you the better APR & Terms you deserve. Don’t forget Frontier Auto Sales has Alaska’s Best referral reward $300 for every individual you send that completes a purchase from Frontier Auto Sales. Disclaimer: All vehicles subject to prior sale. We reserve the right to make changes without notice and are not responsible for errors or omissions. All prices exclude dealer service fees and DMV fees, any finance charges. **Warranties may be available on all models for an additional cost** For more information on this and other vehicles please see our website: www.frontierautosales.org Frontier Auto Sales Frontier Auto Sales 1300 E. Dowling Rd. Anchorage, AK 99518 Call: 907-561-9900 Text: 907-717-7457 Like us on Facebook! Store Hours: Monday - Friday 9 AM - 8 PM Saturday 10 AM - 7 PM Sunday CLOSED 140
HEC Leasing Call Russ or Gary at 888-366-1571 Reduced Down Payments! Factory Warranty Included With Each Truck! Detroit DD13 450HP Engine Fuller 10-Speed Ultrashift Plus Automatic Trans. 342 RAR 240" WB Air Slide 5th Wheel w/In Cab Release Dual 100 Gallon Fuel Tanks 22.5 LP Tires 6 Pol, Aluminum Wheels 72" Double Bunk Sleeper w/ Premium Interior Freightliner Parksmart Batt HVAC System Auxiliary Heater in Sleepers Dual High Back Air Ride Seats AUTOMATIC TRANSMISSIONS!!! We make financing decisions IN HOUSE! Low weekly rates! We make equipment acquisition easy & affordable. We offer flexible financing, and it is our mission to provide you with a transportation solution that meets your hauling needs and fits within your budget. Our lease to own arrangements are designed to making acquiring a truck as painless as possible. We are a small, family-oriented company that provides the personalized service you need to ensure you are acquiring the right truck under the right terms. Leasing is a great way to get your small business off the ground or quickly add capacity to your current operations. To learn more about our truck leasing and financing services, contact us today. Call one of our VP of Sales - Gary Carrigan, Russ Carrigan at 888-366-1571! Peterbilt, Kenworth, Freightliner, Western Star, International, Crew Cab, Mack,Freightliner, Kenworth, Peterbilt, Owner, Operator, Over the Road, Trucking, Drivers, Semi, Truck, International, Day Cab, Cascadia, Courrier, Over The Road, Leasing Semi, Volvo, Commercial, Used Trucks, 99
Pennington Ford, and our sister store, right across the street, Pennington Chevrolet Buick, have been working hard to provide you with a Premier selection of used cars and trucks. We are dedicated to you, our customer. Front to back, we will be here for everything. From helping you search to find the car or truck of your dreams, to delivery. Making your Car shopping experience easier is what we do! Be sure to check out the rest of our inventory, and call with any questions. 618-783-2385. As always here at The Pennington Auto Group, we have financing available and all trades are welcomed. Stop by Newton Illinois, Premier Car Dealership or give us a call, we look forward to doing business with you!!! Pennington Ford & Polaris 806 S Van Buren Newton, IL 62448 618-783-2385 OVER 200+ NEW/USED VEHICLES IN STOCK!! CHECK OUT OUR INVENTORY WWW.PENNINGTONFORD.COM WWW.PENNINGTONCHEVROLETBUICK.COM 79
www.TEXASDIESELSTORE.com CALL OR TEXT 254-247-2172 2016 GMC SIERRA 2500HD SLT - 4X4 6.6L DURAMAX TURBO DIESEL ALLISON AUTOMATIC TRANSMISSION 4 WHEEL DRIVE - z71 158K MILES LEATHER - LOADED! HEATED AND COOLED SEATS TOW HITCH FACTORY NAVIGATION 18" FUEL WHEELS 295/70R18 COOPER TIRES FIXED CENTER CONSOLE SLT PACKAGE W/ LEATHER, BACK UP CAMERA, NAVIGATION & MORE 100% RUST FREE TRADE INS? YES, WE TAKE THEM! CONTACT US WITH YOUR VEHICLE INFO FOR A FREE APPRAISAL! FREE SHIPPING IN THE STATE OF TEXAS!!! WE ALSO OFFER PICK UP FROM THE AUSTIN AIRPORT! FLY IN & DRIVE HOME WITH UP TO A $500 REIMBURSEMENT IN AIRFARE! WE SPECIALIZE IN HARD TO FIND DIESELS! CHECK OUT OUR WEBSITE! OUR INVENTORY IS CONSTANTLY CHANGING WITH NEW ARRIVALS DAILY! CALL OR TEXT 254-247-2172 You are looking at an All Terrain HD edition 2016 GMC SIERRA 2500HD crew cab SLT pick up with the popular 6.6L Duramax turbo diesel engine and Allison automatic transmission! This truck is BEAUTIFUL and super clean! The truck is all stock! Never been hitched! Non smoker truck! The interior looks GREAT as well with leather seats and fixed console with touch screen navigation on the dash! This truck is hard loaded with all important factory options including heated seats, navigation also is smoke free! Does have a few hail dings on the hood. Overall very clean truck! Dual climate control and heated seats for everybody to be comfortable! 100% free of rust! Recently serviced and ready for a new owner! Call or text us today for walk-around videos or pictures! Texas Diesel Store is a NATIONWIDE dealer! We service anyone, anywhere, anytime! We offer shipping options, we can also pick you up from the airport and bring you directly to the dealership! Call or Txt us for a shipping quote!! We also offer extremely competitive Financing options! There is a free credit application on our website. We work hard to have the best financing options available with the most competitive rates. Click the link to fill out a Credit Application: http://texasdieselstore.com/financing Contact Us Today @ 254-247-2172 Shipping and reimbursements are subject to the completion of the transaction at our designated list price. Airfare reimbursements will be refunded at the completion of the deal as long as an airfare receipt is provided for our records. 67
...
2006 Chrysler Sebring GTC Convertible Offered by: Affordable Auto Sales Albuquerque — (505) 431-4233 — $5,995 Power Steering,Power Brakes,Power Door Locks,Power Windows,Power Drivers Seat,Radial Tires,Gauge Cluster,Trip Odometer,Tachometer,Air Conditioning,Tilt Steering Wheel,Cruise Control,Tinted Glass,Driver Side Air Bag,Front Bucket Seats,Reclining Seats,Compact Disc Player,Anti-Lock Braking System,Alloy Wheels,Rear Spoiler,Clock,Interval Wipers,Rear Defroster,Remote Trunk Lid,Console,Carpeting,Vanity Mirror,Day/Night Lever,Dual Sport Mirrors,Driver Side Remote Mirror,Velour Upholstery,Accent Stripes,Body Side Moldings,Center Arm Rest,Map Lights,Inside Hood Release,Fold Down Rear Seat,Maintenance Free Battery,Halogen Headlights,Airbags - Front - Dual,Air Conditioning - Front,Emergency Interior Trunk Release,Reading Lights Front,Child Seat Anchors,Cruise Control,Multi-function Remote Trunk Release,Multi-function Remote Keyless Entry,Front Power Outlet,Center Console Front Console With Storage,Driver Seat Power Adjustments,Exterior Entry Lights,Warnings And Reminders Low Fuel Level,Headlights Auto Delay Off,Power Door Locks,Exterior Mirrors Power,Front Suspension Type: Double Wishbones,Front Suspension Classification: Independent,Convertible Roof Soft Top,Seats Vinyl Upholstery,Seatbelts Seatbelt Pretensioners,Rear Seats Bench,Tachometer,Convertible Rear Window Glass,Convertible Roof Power,Rear Spoiler,Seats Front Seat Type: Bucket,Front Headrests Adjustable,Windows Front Wipers: Intermittent,Power Windows,Suspension Stabilizer Bar(s): Front,Suspension Stabilizer Bar(s): Rear,Windows Rear Defogger,AM/FM STEREO RADIO,PASSENGER SIDE AIR BAG Affordable Auto Sales Albuquerque Year: 2006 Make: Chrysler Model: Sebring Series: GTC Convertible VIN: 1C3EL75R16N153709 Stock #: 153709 Condition: Used Mileage: 157,901 MPG: 21 city / 28 highway Exterior: Green Interior: Dark Slate Gray Body: Convertible Transmission: Automatic Engine: 6 Cylinder ⚡⚡️⚡⚡️⚡️⚡️⚡️⚡️⚡️⚡⚡️⚡️ CALL/TEXT (505) 431-4233⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡⚡️ 🚗 🚕 🚙 🚌 🚎 🏎 - AFFORDABLE AUTO SALES - QUALITY RELIABLE CARS - 🚒 🚐 🚚 🚛 🚜 🏍 ✅ DRIVE AWAY TODAY AT A GREAT CASH PRICE!!!!! ☮️ EASY FINANCING FOR ALL CREDIT SITUATIONS - APPROVED! ☮️ WE EVEN OFFER BUY HERE PAY HERE - WITH APPROVED DOWN PAYMENT! 💥💥💥 FOR FAST PRE-APPROVALS GO TO: 💥💥💥 https://ssl-wwwaffordableautosalesnmcom.dcs-cms.com/creditapp.aspx ☮️ TRADE-INS WELCOME, WE BUY USED CARS! ☮️ VEHICLE HISTORY REPORTS!! CARFAX! ☎️ SE HABLA ESPAÑOL! MAL CREDITO, SIN CREDITO -- NO PROBLEMAS! LLAME O TEXTO (702) 503-2373 ↪️ FOR MORE ADS BY THIS DEALER GO TO: https://albuquerque.craigslist.org/search/ctd?query=affordable+auto+sales More vehicle details: affordable-auto-sales-albuquerque.hammerwebsites.net/v/qTRRfScX Address: 4525 Lomas Blvd NE Albuquerque, NM 87110 Phone: (505) 431-4233 Website: www.affordableautosalesnm.com 📲 ☎️ Call or text (505) 431-4233 for quick answers to your questions about this Chrysler Sebring Your message will always be answered by a real human — never an automated system. Disclaimer: Affordable Auto Sales Albuquerque will never sell, share, or spam your mobile number. Standard text messaging rates may apply. All vehicles are subject to prior sale. We reserve the right to make changes without notice and are not responsible for error or omissions. All prices exclude government fees and taxes, any finance charges and any emissions test charge. It is responsibility of the buyer to call and verify all information in this ad.\r \r ALL PRICES POSTED ARE CASH PRICES ,,, FINANCING IS AVAILABLE,,, ADDITIONAL FEES MAY APPLY\r \r \r In house financing is only available on select vehicles. See dealer for details 2006 Chrysler Sebring GTC Convertible 6828bfd01ab24e88a067a9c2d09416c6 1
Very solid 2001 Ford expedition Eddie bower ed. Great tranny and powerhouse. Sun roof has issues. Duck tape works at pleez. Daily driver but need cash badly. Walk or downgrade . Gotta have 650. No bickering. 650 or keep. No title but not stolen. Have reg. Tag., insur. In name now. Title issues get troublesome. With title it's worth 2500 easy cash. show contact info 1
2009 Buick LaCrosse CXL 133k miles Leather Heated seats Dual power seats Still inspected through 2021 and will have zero issues passing again No engine light or issues Runs and drives smooth and quiet AC is ice cold Heat blows hot for the unfortunate upcoming winter Tires are good Brakes are good No pulsating in the pedal or noises No suspension clunks or noises This car runs and drives 100% Bring $3700 and drive it away 1
2014 Subaru XV Crosstrek Premium Huebner Chevrolet Subaru330-915-31751155 Canton RD NWCarrollton, OH 44615 Price: $15,000Mileage: 73,515Stock #: 64628A Type: UsedTrim: PremiumBodystyle: HatchbackTransmission: Continuously VariableDrive: AWDEngine: 2.0L 4-cylFuel Type: Regular GasolineExterior Color: Crystal Black SilicaInterior Color: IvorySeats: 5 --- LOW MILES!! --- AWD --- NAVIGATION!! --- BACKUP CAMERA --- BLUETOOTH --- CVT - 33 MPG!!! --- 17 ALLOY WHEELS --- FULL POWER ACCESSORIES --- STEERING WHEEL CONTROLS --- HEATED SEATS --- REAR SPOILER --- FOG LIGHTS --- ROOF RAILS --- UNIVERSAL HOME REMOTE ---We've just taken in an ADVENTUROUS, XV Crosstrek 2.0i Premium with just 73,542 miles. The beautiful, Crystal Black Silica exterior and the Ivory cloth interior are both in very nice shape.This 2.0i XV Crosstrek Premium comes with the 2.0L engine, paired with the Lineartronic CVT transmission, and AWD. When equipped with the CVT, the transmission shifts seamlessly through fixed ratios for optimum engine control, getting you improved fuel economy rates of 33 MPG! Plus, Subaru is a leader in All-Wheel-Drive technology so the Crosstrek boasts one of the best systems in it's class!You get some really great features on the XV Crosstrek, like heated seats, premium 17 charcoal and silver alloy wheels, fog lights, roof rails, heated side mirrors, power accessories, cruise control and Bluetooth connectivity for your phone and audio streaming. For sound, you gain a 6-Speaker Audio System that features a CD player, USB port and an iPod/MP3 audio input.If you want fun and versatility all rolled into one small package, then this sharp looking, adventurous 2014 XV Crosstrek Premium is just what you've been looking for!! Here at HUEBNER CHEVROLET SUBARU we make sure to put every used vehicle through an extensive 100 Point Safety Inspection so you can buy and drive with peace of mind. Come in for a no pressure test drive today! http://www.autofusion.com/AF2/vdp/47098108.html 1
2007 Nissan Maxima Very Strong Engine Still Running and Driving, Heated Seats, Touchscreen Radio, Bluetooth, Sunroof, Backup Camera, Blind Spot Camera, Leather Seats, Power Seats, Reclining Seats in the Back, Ice Cold Ac, Push to Start, Remote Key, Keyless Entry, Mini Fridge/cooler, Power Windows, Runs Fine No Issues Ac Blows Cold, Leather Seats Moonroof Tinted Windows Bluetooth & Aux Cord, Excellent Condition Second Owner, Original Low Miles, New Tires, All Power, 6 Cd Player, Sirius Radio, a/c and Heater Works Excellent and It's a Very Clean Truck Inside and Outside. 1
Name: description, Length: 275765, dtype: int64
Series([], Name: county, dtype: int64)
ca 45798
fl 27413
tx 23411
ny 18186
or 17312
oh 16920
mi 15995
nc 15179
wa 14260
wi 12744
pa 12376
co 11996
tn 10722
va 10580
nj 10159
il 10143
id 9482
ia 9005
mn 8965
az 8353
mt 7580
ma 7548
ok 6481
sc 6470
in 6028
ga 6012
ks 5875
al 5232
ct 5082
mo 4915
ky 4662
md 4468
ar 4320
nm 4151
ak 3973
nv 3081
hi 2969
nh 2958
me 2941
dc 2936
la 2935
vt 2588
ri 2530
ne 2087
sd 1756
ms 1399
ut 1039
wy 844
de 839
wv 693
nd 466
Name: state, dtype: int64
33.7792 4259
43.1824 2769
46.2348 2026
40.2043 1665
40.4688 1558
...
41.8691 1
29.6278 1
32.6157 1
58.3605 1
44.0055 1
Name: lat, Length: 41175, dtype: int64
-84.4118 4259
-84.1122 2767
-119.1280 2071
-74.1779 1665
-74.2817 1558
...
-77.4883 1
-94.7044 1
-84.3864 1
-72.5057 1
-88.2199 1
Name: long, Length: 41492, dtype: int64
df= df.drop(columns=['id','url', 'region_url', 'vin', 'image_url', 'description', 'lat', 'long','county','region'], axis=1)
df.head()
| price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | size | type | paint_color | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6995 | 2000.0 | gmc | new sierra 1500 | excellent | 8 cylinders | gas | 167783.0 | clean | automatic | 4wd | NaN | NaN | red | mn |
| 1 | 8750 | 2013.0 | hyundai | sonata | excellent | 4 cylinders | gas | 90821.0 | clean | automatic | fwd | NaN | NaN | grey | mn |
| 2 | 10900 | 2013.0 | toyota | prius | good | 4 cylinders | hybrid | 92800.0 | clean | automatic | fwd | NaN | NaN | blue | ct |
| 3 | 12500 | 2003.0 | mitsubishi | lancer | good | 4 cylinders | gas | NaN | clean | manual | 4wd | mid-size | sedan | grey | nm |
| 4 | 16995 | 2007.0 | gmc | sierra classic 2500hd | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 4wd | full-size | truck | white | mn |
df.describe()
| price | year | odometer | |
|---|---|---|---|
| count | 4.238570e+05 | 328743.000000 | 2.705850e+05 |
| mean | 5.779437e+04 | 2009.827358 | 9.942578e+04 |
| std | 1.002581e+07 | 10.135809 | 1.204833e+05 |
| min | 0.000000e+00 | 1900.000000 | 0.000000e+00 |
| 25% | 4.499000e+03 | 2007.000000 | 4.432400e+04 |
| 50% | 9.995000e+03 | 2012.000000 | 9.174900e+04 |
| 75% | 1.899500e+04 | 2016.000000 | 1.372370e+05 |
| max | 3.808256e+09 | 2021.000000 | 1.000000e+07 |
Lets see if any outliers in Target variable as we remove them the model can be more accurate because they cause a bring a lot of difference in values of mean and SD.
the difference between 75% value and max value is too large so lets leave 10% values at both ends of a distribution
rr=sorted(df["price"])
quantile1, quantile3= np.percentile(rr,[10,90])
print(quantile1,quantile3)
389.6000000000058 31500.0
df=df[(df.price < 27500) & (df.price >= 500 )]
df.shape
(322693, 15)
r=sorted(df["odometer"])
r
[137.0,
45431.0,
48932.0,
49858.0,
64515.0,
76202.0,
77200.0,
79160.0,
81900.0,
86942.0,
88000.0,
89156.0,
89875.0,
90000.0,
90112.0,
90821.0,
91840.0,
92800.0,
94800.0,
102000.0,
102462.0,
102825.0,
103000.0,
115000.0,
116711.0,
124995.0,
125000.0,
126207.0,
128000.0,
135000.0,
138588.0,
142835.0,
143528.0,
148334.0,
150959.0,
167783.0,
nan,
108124.0,
145955.0,
152415.0,
159000.0,
160600.0,
170259.0,
170684.0,
176386.0,
178054.0,
178108.0,
181986.0,
184817.0,
188406.0,
196913.0,
200000.0,
208016.0,
210865.0,
223470.0,
254217.0,
271000.0,
309621.0,
nan,
nan,
nan,
299000.0,
nan,
nan,
8000.0,
36000.0,
53300.0,
62000.0,
65000.0,
68559.0,
73000.0,
82000.0,
91806.0,
94448.0,
100460.0,
110000.0,
113000.0,
130000.0,
143700.0,
156559.0,
167828.0,
186200.0,
229937.0,
nan,
85000.0,
176000.0,
204496.0,
230500.0,
242613.0,
262000.0,
nan,
nan,
208453.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
196000.0,
260607.0,
nan,
500.0,
15414.0,
38000.0,
50000.0,
56289.0,
58764.0,
63000.0,
64596.0,
76164.0,
85122.0,
85556.0,
87000.0,
87742.0,
92000.0,
104897.0,
104898.0,
105000.0,
105880.0,
108869.0,
115000.0,
116751.0,
119592.0,
123000.0,
125147.0,
148212.0,
152000.0,
153000.0,
158000.0,
160025.0,
165000.0,
192420.0,
nan,
179804.0,
nan,
216800.0,
nan,
165000.0,
193000.0,
nan,
nan,
nan,
35224.0,
40629.0,
50273.0,
82740.0,
103000.0,
105000.0,
113400.0,
124000.0,
140000.0,
141118.0,
152000.0,
195413.0,
nan,
nan,
nan,
111563.0,
144000.0,
147893.0,
155000.0,
158000.0,
192000.0,
251000.0,
nan,
144263.0,
172000.0,
187000.0,
195756.0,
210000.0,
nan,
nan,
nan,
nan,
168000.0,
181000.0,
223000.0,
250000.0,
nan,
238000.0,
319000.0,
nan,
1699.0,
27000.0,
32428.0,
53646.0,
66782.0,
74408.0,
78507.0,
86643.0,
93559.0,
104033.0,
109527.0,
119598.0,
131164.0,
142055.0,
170623.0,
nan,
142763.0,
nan,
nan,
22289.0,
28269.0,
50776.0,
73766.0,
74273.0,
79091.0,
80242.0,
90822.0,
102783.0,
110346.0,
130348.0,
134177.0,
176580.0,
nan,
6988.0,
13479.0,
32276.0,
40234.0,
40533.0,
43400.0,
48900.0,
51636.0,
55268.0,
58112.0,
71416.0,
77000.0,
79198.0,
80566.0,
84411.0,
93060.0,
96000.0,
97540.0,
99898.0,
103369.0,
110000.0,
112103.0,
112447.0,
116022.0,
120000.0,
120940.0,
127411.0,
145996.0,
146660.0,
150778.0,
152000.0,
161201.0,
178318.0,
179000.0,
185423.0,
187867.0,
196611.0,
198117.0,
198334.0,
nan,
nan,
nan,
nan,
nan,
nan,
19734.0,
21449.0,
35677.0,
37885.0,
43961.0,
66850.0,
72530.0,
75771.0,
77945.0,
79639.0,
100761.0,
105000.0,
110400.0,
132000.0,
133000.0,
134041.0,
138000.0,
149273.0,
151000.0,
154025.0,
159147.0,
160000.0,
178000.0,
179856.0,
nan,
nan,
nan,
161776.0,
164030.0,
nan,
nan,
nan,
nan,
220000.0,
nan,
nan,
nan,
nan,
13848.0,
21800.0,
46145.0,
47055.0,
54062.0,
65748.0,
90261.0,
91001.0,
93988.0,
95000.0,
97426.0,
100694.0,
102000.0,
103240.0,
103600.0,
105311.0,
109210.0,
110380.0,
113000.0,
113000.0,
122269.0,
126000.0,
134000.0,
nan,
nan,
118587.0,
141690.0,
145155.0,
190000.0,
200000.0,
200245.0,
215000.0,
nan,
nan,
nan,
121365.0,
168000.0,
174599.0,
179942.0,
290000.0,
nan,
nan,
nan,
nan,
299.0,
17767.0,
23595.0,
24335.0,
26704.0,
30628.0,
35322.0,
35500.0,
35650.0,
45024.0,
67500.0,
68000.0,
68670.0,
69618.0,
75346.0,
80275.0,
84550.0,
85000.0,
94877.0,
95500.0,
99000.0,
102270.0,
102830.0,
106000.0,
110000.0,
111090.0,
112000.0,
114000.0,
116435.0,
121352.0,
128929.0,
131000.0,
136000.0,
140000.0,
143000.0,
155000.0,
156000.0,
157000.0,
158000.0,
159000.0,
161626.0,
253000.0,
nan,
nan,
nan,
nan,
140000.0,
183000.0,
nan,
nan,
nan,
nan,
131000.0,
134523.0,
170000.0,
174000.0,
nan,
22432.0,
26000.0,
27000.0,
28777.0,
35623.0,
36442.0,
38475.0,
40574.0,
43708.0,
44026.0,
45000.0,
49615.0,
69000.0,
78000.0,
82000.0,
84000.0,
94025.0,
95000.0,
105131.0,
106182.0,
106182.0,
107500.0,
113000.0,
120000.0,
125000.0,
138000.0,
nan,
nan,
nan,
nan,
nan,
nan,
139900.0,
147000.0,
165000.0,
nan,
nan,
38579.0,
96857.0,
nan,
8805.0,
21000.0,
38475.0,
46000.0,
69761.0,
78869.0,
80945.0,
80945.0,
92993.0,
96000.0,
96857.0,
101758.0,
115000.0,
120000.0,
123212.0,
127146.0,
128518.0,
128647.0,
130000.0,
133000.0,
133183.0,
134000.0,
135362.0,
141000.0,
144000.0,
149000.0,
156201.0,
156940.0,
164715.0,
170000.0,
182000.0,
182000.0,
192399.0,
197000.0,
nan,
nan,
nan,
15776.0,
16000.0,
22501.0,
60952.0,
73079.0,
79307.0,
82000.0,
97000.0,
100755.0,
105000.0,
114000.0,
116998.0,
123000.0,
125770.0,
127000.0,
128000.0,
128000.0,
129000.0,
130469.0,
133000.0,
135000.0,
135323.0,
138095.0,
143376.0,
145000.0,
149260.0,
160016.0,
162000.0,
170000.0,
172170.0,
177000.0,
189000.0,
189723.0,
190610.0,
203000.0,
nan,
nan,
197000.0,
1655000.0,
nan,
nan,
200500.0,
246000.0,
nan,
164000.0,
172000.0,
nan,
nan,
nan,
162000.0,
234000.0,
nan,
nan,
nan,
228000.0,
nan,
nan,
nan,
nan,
99800.0,
129000.0,
157000.0,
160000.0,
nan,
15822.0,
18680.0,
31322.0,
38109.0,
38287.0,
42303.0,
42703.0,
42819.0,
52247.0,
56634.0,
62922.0,
64262.0,
67132.0,
68559.0,
76301.0,
76867.0,
78000.0,
83000.0,
86865.0,
87875.0,
91806.0,
92483.0,
95195.0,
98000.0,
109000.0,
115048.0,
118369.0,
123000.0,
124000.0,
130055.0,
142618.0,
149553.0,
150000.0,
158500.0,
164000.0,
169700.0,
186000.0,
nan,
17831.0,
24000.0,
37403.0,
44000.0,
51900.0,
54040.0,
56289.0,
56544.0,
57900.0,
64669.0,
74875.0,
75100.0,
78099.0,
79900.0,
96000.0,
102903.0,
105567.0,
122000.0,
124000.0,
131500.0,
135000.0,
145500.0,
147318.0,
148500.0,
153741.0,
167882.0,
170000.0,
172817.0,
174000.0,
186709.0,
188250.0,
192223.0,
198895.0,
198900.0,
206000.0,
220000.0,
237600.0,
238000.0,
277000.0,
nan,
nan,
nan,
76373.0,
139836.0,
nan,
nan,
nan,
12783.0,
22909.0,
24586.0,
38437.0,
55802.0,
58098.0,
58613.0,
58893.0,
64809.0,
72095.0,
92367.0,
101270.0,
128000.0,
167160.0,
nan,
64982.0,
76074.0,
84085.0,
97846.0,
109926.0,
124000.0,
138154.0,
140123.0,
140290.0,
142786.0,
159219.0,
168000.0,
169072.0,
176617.0,
194000.0,
200000.0,
nan,
nan,
nan,
4556.0,
8473.0,
10824.0,
25277.0,
25739.0,
31908.0,
38018.0,
50000.0,
50670.0,
58828.0,
74662.0,
87000.0,
91601.0,
94000.0,
99762.0,
101827.0,
101856.0,
103000.0,
120000.0,
144000.0,
153342.0,
195230.0,
nan,
175325.0,
nan,
108260.0,
110000.0,
130000.0,
180000.0,
191206.0,
206000.0,
1000000.0,
nan,
nan,
nan,
nan,
4356.0,
10213.0,
13147.0,
18416.0,
25277.0,
26237.0,
29711.0,
37990.0,
48789.0,
49539.0,
60137.0,
70200.0,
70206.0,
73409.0,
73988.0,
75680.0,
84483.0,
86568.0,
89329.0,
96200.0,
100455.0,
105593.0,
110365.0,
114299.0,
117000.0,
120298.0,
121005.0,
138365.0,
154000.0,
155000.0,
156008.0,
159000.0,
163000.0,
177000.0,
188000.0,
192000.0,
197000.0,
224000.0,
248000.0,
nan,
nan,
237000.0,
nan,
270330.0,
nan,
32752.0,
41000.0,
69110.0,
71455.0,
72869.0,
74057.0,
nan,
49880.0,
77139.0,
83100.0,
88652.0,
90095.0,
94280.0,
128322.0,
129094.0,
161000.0,
nan,
87828.0,
160000.0,
167384.0,
nan,
nan,
nan,
nan,
nan,
10700.0,
34600.0,
131000.0,
144000.0,
182000.0,
nan,
221000.0,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
0.0,
14190.0,
15000.0,
21007.0,
31286.0,
36824.0,
36990.0,
37932.0,
51728.0,
54269.0,
57308.0,
59097.0,
64507.0,
67336.0,
69245.0,
71032.0,
73012.0,
75036.0,
78470.0,
80559.0,
92481.0,
99220.0,
115000.0,
125000.0,
136000.0,
139000.0,
142718.0,
145000.0,
147000.0,
178174.0,
184285.0,
500000.0,
nan,
nan,
177558.0,
178045.0,
180000.0,
233840.0,
nan,
nan,
nan,
15822.0,
18120.0,
18309.0,
20270.0,
25577.0,
26139.0,
27057.0,
27637.0,
32639.0,
32752.0,
33924.0,
34933.0,
38287.0,
39524.0,
40142.0,
42703.0,
58098.0,
66199.0,
82000.0,
131456.0,
nan,
nan,
nan,
44000.0,
82854.0,
87791.0,
119000.0,
120432.0,
125103.0,
nan,
nan,
130066.0,
133280.0,
143000.0,
145331.0,
178000.0,
217000.0,
228000.0,
nan,
42739.0,
56593.0,
110004.0,
nan,
nan,
nan,
28598.0,
29147.0,
29266.0,
32336.0,
41546.0,
49683.0,
50067.0,
52926.0,
52985.0,
58428.0,
63599.0,
70579.0,
72664.0,
98000.0,
99129.0,
100648.0,
102917.0,
109000.0,
119000.0,
128000.0,
134000.0,
148865.0,
152000.0,
158000.0,
165000.0,
179296.0,
nan,
150795.0,
182000.0,
183015.0,
192420.0,
nan,
172800.0,
nan,
nan,
23315.0,
26354.0,
29609.0,
32841.0,
34341.0,
34449.0,
44000.0,
49189.0,
53615.0,
54297.0,
55000.0,
56136.0,
60421.0,
69000.0,
89000.0,
94673.0,
97604.0,
102000.0,
109391.0,
109422.0,
116000.0,
126000.0,
133465.0,
135723.0,
140000.0,
141731.0,
142951.0,
145750.0,
158335.0,
159000.0,
182000.0,
199412.0,
205000.0,
nan,
249750.0,
nan,
163000.0,
177364.0,
nan,
1.0,
7700.0,
11675.0,
19775.0,
22332.0,
28418.0,
32061.0,
40000.0,
46400.0,
59902.0,
64000.0,
65000.0,
66098.0,
69000.0,
82814.0,
83586.0,
101635.0,
116154.0,
134117.0,
139415.0,
141773.0,
143157.0,
143267.0,
144370.0,
152946.0,
153796.0,
156741.0,
162919.0,
179200.0,
179854.0,
185326.0,
189600.0,
222000.0,
224139.0,
234000.0,
nan,
247181.0,
nan,
nan,
124.0,
126.0,
131.0,
138.0,
146.0,
4472.0,
8025.0,
8697.0,
18137.0,
29490.0,
38000.0,
60678.0,
61007.0,
64541.0,
70000.0,
74000.0,
77022.0,
90406.0,
124479.0,
...]
df["odometer"].isna().sum()
117593
ax = sns.scatterplot(x="odometer", y="price", data=df)
df["odometer"].max()
10000000.0
df.drop(df[df["odometer"]==64809218.0].index,inplace=True)
df.drop(df[df["odometer"]==0.0].index,inplace=True)
ax = sns.scatterplot(x="odometer", y="price", data=df)
ax.get_xaxis().get_major_formatter().set_scientific(False)
ax.get_yaxis().get_major_formatter().set_scientific(False)
df["odometer"].isna().sum()
0
df=df[(df.odometer < 3000000)]
ax = sns.scatterplot(x="odometer", y="price", data=df)
ax.get_xaxis().get_major_formatter().set_scientific(False)
Now lets see year column
df["odometer"].isna().sum()
0
df["year"].isna().sum()
471
the null values in a year column cannot be replaced so lets eliminate them
df["year"].min()
1919.0
we must alo drop this 0 values
df.drop(df[df["year"]==0.0].index,inplace=True)
df=df.dropna(subset=['year'])
bx = sns.scatterplot(x="year", y="price", data=df)
df=df[(df.year > 1940)]
we have removed outliers from all three numerical columns . now lets how we can remove with nan values if any.
I am taking out % of null values in each column
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
| null | percent | |
|---|---|---|
| size | 12960400 | 63.617 |
| condition | 7777500 | 38.177 |
| cylinders | 6546200 | 32.133 |
| paint_color | 4872100 | 23.915 |
| drive | 4708300 | 23.111 |
| type | 3891800 | 19.103 |
| manufacturer | 623500 | 3.061 |
| model | 157000 | 0.771 |
| fuel | 144800 | 0.711 |
| title_status | 112100 | 0.550 |
| transmission | 82700 | 0.406 |
| price | 0 | 0.000 |
| year | 0 | 0.000 |
| odometer | 0 | 0.000 |
| state | 0 | 0.000 |
df.condition.value_counts()
excellent 60679
good 48590
like new 11258
fair 4796
new 321
salvage 305
Name: condition, dtype: int64
excellent_odo_mean = df[df['condition'] == 'excellent']['odometer'].mean()
good_odo_mean = df[df['condition'] == 'good']['odometer'].mean()
like_new_odo_mean = df[df['condition'] == 'like new']['odometer'].mean()
salvage_odo_mean = df[df['condition'] == 'salvage']['odometer'].mean()
fair_odo_mean = df[df['condition'] == 'fair']['odometer'].mean()
print('Like new average odometer:', round( like_new_odo_mean,2))
print('Excellent average odometer:', round( excellent_odo_mean,2))
print('Good average odometer:', round( good_odo_mean,2))
print('Fair average odometer:', round( fair_odo_mean,2))
print('Salvage average odometer:', round( salvage_odo_mean,2))
Like new average odometer: 85098.04
Excellent average odometer: 112103.1
Good average odometer: 121570.61
Fair average odometer: 182986.37
Salvage average odometer: 168911.82
df.loc[df.year>=2019, 'condition'] = df.loc[df.year>=2019, 'condition'].fillna('new')
df.loc[df['odometer'] <= like_new_odo_mean, 'condition'] = df.loc[df['odometer'] <= like_new_odo_mean, 'condition'].fillna('like new')
df.loc[df['odometer'] >= fair_odo_mean, 'condition'] = df.loc[df['odometer'] >= fair_odo_mean, 'condition'].fillna('fair')
df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= excellent_odo_mean)), 'condition'] = df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= excellent_odo_mean)), 'condition'].fillna('excellent')
df.loc[((df['odometer'] > like_new_odo_mean) &
(df['odometer'] <= good_odo_mean)), 'condition'] = df.loc[((df['odometer'] > like_new_odo_mean) &
(df['odometer'] <= good_odo_mean)), 'condition'].fillna('good')
df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= fair_odo_mean)), 'condition'] = df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= fair_odo_mean)), 'condition'].fillna('salvage')
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
| null | percent | |
|---|---|---|
| size | 12960400 | 63.617 |
| cylinders | 6546200 | 32.133 |
| paint_color | 4872100 | 23.915 |
| drive | 4708300 | 23.111 |
| type | 3891800 | 19.103 |
| manufacturer | 623500 | 3.061 |
| model | 157000 | 0.771 |
| fuel | 144800 | 0.711 |
| title_status | 112100 | 0.550 |
| transmission | 82700 | 0.406 |
| price | 0 | 0.000 |
| year | 0 | 0.000 |
| condition | 0 | 0.000 |
| odometer | 0 | 0.000 |
| state | 0 | 0.000 |
df=df.dropna(subset=['title_status','fuel','transmission','model','manufacturer'])
df=df.drop(["size"],axis=1)
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
| null | percent | |
|---|---|---|
| cylinders | 6176000 | 31.960 |
| paint_color | 4617300 | 23.894 |
| drive | 4431900 | 22.934 |
| type | 3672300 | 19.004 |
| price | 0 | 0.000 |
| year | 0 | 0.000 |
| manufacturer | 0 | 0.000 |
| model | 0 | 0.000 |
| condition | 0 | 0.000 |
| fuel | 0 | 0.000 |
| odometer | 0 | 0.000 |
| title_status | 0 | 0.000 |
| transmission | 0 | 0.000 |
| state | 0 | 0.000 |
df['paint_color'] = df['paint_color'].fillna(method='ffill')
df['drive'] = df['drive'].fillna(method='ffill')
df['type'] = df['type'].fillna(method='ffill')
df['cylinders'] = df['cylinders'].fillna(method='ffill')
df.isnull().sum()
price 0
year 0
manufacturer 0
model 0
condition 0
cylinders 0
fuel 0
odometer 0
title_status 0
transmission 0
drive 0
type 3
paint_color 0
state 0
dtype: int64
df=df.dropna(subset=['cylinders','drive'])
df.isnull().sum()
price 0
year 0
manufacturer 0
model 0
condition 0
cylinders 0
fuel 0
odometer 0
title_status 0
transmission 0
drive 0
type 3
paint_color 0
state 0
dtype: int64
from scipy import stats
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0xc734d400>
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193243 entries, 0 to 423856
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 193243 non-null int64
1 year 193243 non-null float64
2 manufacturer 193243 non-null object
3 model 193243 non-null object
4 condition 193243 non-null object
5 cylinders 193243 non-null object
6 fuel 193243 non-null object
7 odometer 193243 non-null float64
8 title_status 193243 non-null object
9 transmission 193243 non-null object
10 drive 193243 non-null object
11 type 193240 non-null object
12 paint_color 193243 non-null object
13 state 193243 non-null object
dtypes: float64(2), int64(1), object(11)
memory usage: 22.1+ MB
c=df.columns
for i in c:
print(df[i].value_counts())
7995 2595
8995 2523
6995 2386
9995 2372
5995 2241
...
26699 1
16462 1
10435 1
6341 1
18441 1
Name: price, Length: 8905, dtype: int64
2017.0 15378
2013.0 14735
2014.0 13842
2012.0 13140
2011.0 12678
...
1958.0 8
1954.0 7
1947.0 3
1952.0 3
1942.0 1
Name: year, Length: 78, dtype: int64
ford 33184
chevrolet 26703
toyota 16707
honda 13674
nissan 12246
jeep 8519
dodge 7617
gmc 6492
hyundai 6423
subaru 6093
volkswagen 5306
bmw 5183
kia 5113
ram 4481
mercedes-benz 4094
chrysler 4082
buick 3251
mazda 3008
cadillac 2760
lexus 2630
audi 2071
acura 1995
lincoln 1734
infiniti 1487
pontiac 1474
volvo 1374
mitsubishi 1340
mini 1030
mercury 938
saturn 779
rover 470
jaguar 415
fiat 410
harley-davidson 68
alfa-romeo 42
datsun 33
tesla 7
land rover 7
porche 2
morgan 1
Name: manufacturer, dtype: int64
f-150 2775
escape 2394
camry 2043
silverado 1500 1947
altima 1825
...
sportage lx awd suv 1
benz e 300 1
s350 1
romeo giulia turbo sport 1
sedona lx lx w/uvo, 1
Name: model, Length: 14357, dtype: int64
good 62848
excellent 58189
like new 37501
salvage 19682
fair 10574
new 4449
Name: condition, dtype: int64
4 cylinders 77762
6 cylinders 69569
8 cylinders 42229
5 cylinders 1934
10 cylinders 792
other 476
3 cylinders 366
12 cylinders 115
Name: cylinders, dtype: int64
gas 176996
diesel 7707
other 5240
hybrid 2827
electric 473
Name: fuel, dtype: int64
150000.0 647
140000.0 617
160000.0 601
130000.0 597
170000.0 563
...
232100.0 1
166563.0 1
83281.0 1
3764.0 1
113853.0 1
Name: odometer, Length: 71104, dtype: int64
clean 185464
rebuilt 4332
salvage 2131
lien 1130
missing 138
parts only 48
Name: title_status, dtype: int64
automatic 175068
manual 12131
other 6044
Name: transmission, dtype: int64
fwd 89763
4wd 71858
rwd 31622
Name: drive, dtype: int64
sedan 60245
SUV 53032
truck 14423
pickup 13544
coupe 9359
other 9192
hatchback 9128
wagon 7980
van 6410
mini-van 5156
convertible 4293
offroad 331
bus 147
Name: type, dtype: int64
white 45381
black 37263
silver 31001
blue 21752
grey 20148
red 19419
custom 5686
green 5277
brown 4388
yellow 1285
orange 1055
purple 588
Name: paint_color, dtype: int64
ca 16791
fl 12208
mi 9402
oh 9401
ny 8663
tx 8296
wi 7566
nc 6618
pa 6521
or 5775
va 5765
nj 5720
tn 5492
co 5438
ia 5064
wa 4382
id 4239
ma 4048
il 4039
mn 4020
mt 3971
in 3457
ct 3073
sc 2858
ok 2735
al 2688
ky 2667
ks 2646
md 2496
ak 2408
az 2401
mo 2292
ga 2206
ar 1729
nm 1556
nh 1502
nv 1481
ri 1440
vt 1365
me 1340
dc 1293
la 1195
hi 894
sd 875
ne 722
ms 697
ut 515
de 424
wy 326
nd 275
wv 268
Name: state, dtype: int64
ax = sns.barplot(x="condition", y="price", data=df)
Clearly vehicles with condition new has highest price as one expects.
sns.catplot(y="cylinders", x="price",kind = "violin", data=df)
<seaborn.axisgrid.FacetGrid at 0xc734de48>
sns.catplot(x="fuel", y="price", kind="boxen",
data=df)
<seaborn.axisgrid.FacetGrid at 0xc7ad9dd8>
sns.catplot(x="title_status", y="price",kind="violin", data=df)
<seaborn.axisgrid.FacetGrid at 0xc930efd0>
sns.catplot(x="transmission", y="price",kind="bar", palette="ch:.25", data=df)
<seaborn.axisgrid.FacetGrid at 0xd0bdc2e8>
The relation between price and transmission.
sns.violinplot(x=df.drive, y=df.price);
There doesnt seem too be much difference between the first 2 types of drives .
The third one is a bit different.
sns.catplot(y="type", x="price",kind="boxen", data=df);
sns.catplot(y="paint_color", x="price",kind="violin", data=df);
sns.catplot(y="manufacturer", x="price",kind="box", data=df);
from sklearn import preprocessing
import pandas as pd
le = preprocessing.LabelEncoder()
df.columns
Index(['price', 'year', 'manufacturer', 'model', 'condition', 'cylinders',
'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type',
'paint_color', 'state'],
dtype='object')
df[['manufacturer', 'model', 'condition',
'cylinders', 'fuel', 'title_status', 'transmission',
'drive', 'type', 'paint_color', 'state']]=df[['manufacturer', 'model', 'condition',
'cylinders', 'fuel', 'title_status', 'transmission',
'drive', 'type', 'paint_color', 'state']].apply(le.fit_transform)
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
y= df.price
X= df.drop('price',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(154594, 13) (154594,)
(38649, 13) (38649,)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn import metrics
from sklearn.metrics import mean_squared_error as MSE
print('Mean Absolute Error:', round(metrics.mean_absolute_error(y_test, y_pred),2))
print('Mean Squared Error:', round(metrics.mean_squared_error(y_test, y_pred),2))
print('Root Mean Squared Error:', round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),2))
Mean Absolute Error: 1726.32
Mean Squared Error: 7435213.35
Root Mean Squared Error: 2726.76
d = pd.DataFrame()
d["test_y"] = y_test
d["y_pred"] = y_pred
#Mape with formula
d["mp"] = abs((d["test_y"]- d["y_pred"])/d["test_y"])
(d.mp.mean())*100#mape
29.507433494326758