ããŒã¿ãåæãããšãã¯ãæ£ããããŒã¿åã§ããŒã¿ãæ±ããªããšãäºæããªãçµæããšã©ãŒãåºãå¯èœæ§ããããŸãã
Pythonã§ããŒã¿åæããããšãã¯Pandasããã䜿ããŸãããPandasã§ã¯èªã¿èŸŒãã ããŒã¿ã®åãèªåã§æšå®ããŠãããŸãããã®ãšãã®ç²ŸåºŠã¯é«ããåæã®æç¹ã§ããããªããŒã¿åãšããŠèªã¿èŸŒãŸããããšã¯ããŸããªããšæããŸãã
ã§ãããåæããã»ã¹ã®éäžã§ããŒã¿åã®å€æãå¿ èŠã«ãªããšããã±ãŒã¹ã¯çšã§ã¯ãããŸããã
ä»åã¯ãPandasã§ã®åºæ¬çãªããŒã¿åãšããŒã¿åã®å€ææ¹æ³ãå ·äœçãªããŒã¿ã䜿ããªããèŠãŠããããšæããŸãã
Pandasã«ãããããŒã¿å
ããŒã¿åã¯ããã°ã©ãã³ã°èšèªããã®ããŒã¿ãã©ã®ããã«æ ŒçŽãæäœããããèå¥ããããã®å éšæ§é ã§ãã
äŸãã°ã1+1ã®çµæã¯2ãè¿ããŸããã”ããã«ã¡ã¯”+”ã¿ãªãã”ã®çµæ㯔ããã«ã¡ã¯ã¿ãªãã”ãšãªããŸãããã®ãããªåŠçã®éããèå¥ããããã«çšããã®ãããŒã¿åã§ãã
Pandasã§ã¯ä»¥äžã®ãããªããŒã¿åã®çš®é¡ããããŸãã
Pandasã®å | Pythonã®å | çšé |
---|---|---|
object | str or mixed | æååãæååãšæ°å€ã®æ··åå |
int64 | int | æŽæ° |
float64 | float | æµ®åå°æ°ç¹ |
bool | bool | ç/åœ |
datetime64 | NA | æ¥ä»ããã³æå» |
timedelta[ns] | NA | 2ã€ã®æ¥æã®å·® |
category | NA | æååã®ã«ããŽãª |
Pandasã«ãããObjectåã¯æåååãæããŸãããåæã«è€æ°ã®ããŒã¿åãå«ããããšãã§ããŸãã
ãã ãã1ã€ã®åã«è€æ°ã®ããŒã¿åã®å€ãä¿æããŠãããšåæã§å°ãããšã«ãªããŸãïŒåŸè¿°ããŸãïŒã
䜿çšããããŒã¿
ä»å䜿çšããããŒã¿ã¯ã以äžãªã³ã¯ããååŸããCSVãã¡ã€ã«ã䜿çšããŸãã
https://github.com/chris1610/pbpython/tree/master/data
import pandas as pd
df = pd.read_csv("input/sales_data_types.csv")
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | $125,000.00 | $162,500.00 | 30.00% | 500 | 1 | 10 | 2015 | Y |
1 | 552278 | Smith Plumbing | $920,000.00 | $1,012,000.00 | 10.00% | 700 | 6 | 15 | 2014 | Y |
2 | 23477 | ACME Industrial | $50,000.00 | $62,500.00 | 25.00% | 125 | 3 | 29 | 2016 | Y |
3 | 24900 | Brekke LTD | $350,000.00 | $490,000.00 | 4.00% | 75 | 10 | 27 | 2015 | Y |
4 | 651029 | Harbor Co | $15,000.00 | $12,750.00 | -15.00% | Closed | 2 | 2 | 2014 | N |
å€ã¯å šãŠåãŸã£ãŠãããåæã«ã¯äœ¿ããããªããŒã¿ã§ãã
Pandasã§ã®åŠç
ããŒã¿åã確èªãã
ããŒã¿ãèªã¿èŸŒãã ãšããã§ãããŒã¿ãã¬ãŒã ã®ããŒã¿åã確èªããŸãã
ããŒã¿åã確èªããã«ã¯ãdf.dtypesã䜿ããŸãã
df.dtypes
ããŠãåããŒã¿åã¯ãããããµããããããŒã¿åã«ãªã£ãŠããã®ã§ããããïŒ
- “Customer Number”ã¯intåã®æ¹ããããã
- “2016”ããã³”2017″ã¯æ°å€åãšããŠæ±ããã
- “Percent Growth”ãš”Jan Units”ãæ°å€ãšããŠæ±ããã
- “Month”, “Day”, “Year”ã¯æ¥ä»åã§æ±ããã
- “Active”ã¯YesãNoã®boolåãšããã
ãããªæãã§ãåæããããã®ããŒã¿åãšã¯ãªã£ãŠããªãããã§ãããªã®ã§ãããããç®çã®ããŒã¿åã«å€æããå¿ èŠããããŸãã
ããŒã¿åãå€æããïŒæååâæ°å€ïŒ
Pandasã§ããŒã¿åãå€æããã«ã¯df.astype()ãçšããŸãã
äŸãšããŠã”Customer Number”ãintåã«å€æããŸãã
df['Customer Number'] = df['Customer Number'].astype('int')
print(df.dtypes)
ããŒã¿åãå€æãããŠããããšãããããŸãã
ç®çã®åã«å€æã§ããªããšã
åæ§ã«ããŠãä»ã®åãç®çã®åã«å€æããŠãããŸãã次㯔2016″ãæ°å€åã«å€æããŠã¿ãŸãã
df['2016'] = df['2016'].astype('float')
ãšãããããšããã§ããããšã©ãŒãåºãŠããŸããŸããã
ã©ããããïŒãããïŒããåå ã¿ããã§ããæ°å€ãšããŠæ±ãã«ã¯ããããåãé€ãå¿ èŠãããããã§ãã
æååã®çœ®æã«ã¯replace()ã䜿ããŸãããŸããç°¡æœã«è¡šçŸããããã«lambdaé¢æ°ã䜿çšããŸãã
lambda x: x.replace('$', '').replace(',', '')
ãŸãã”2016″ã®åå šãŠã®é ç®ã«äžèšã®lanbdaé¢æ°ãé©çšããå¿ èŠããããŸããdf.apply()ã䜿ããšãåé ç®ã«åŒæ°ã«æž¡ããé¢æ°ã®åŠçãå®è¡ããããšãã§ããŸãã
ãããŠãæåŸã«floatåã«å€æããŸãã
df['2016'] = df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
å€æããããšãã§ããŸãããã¡ãªã¿ã«”2016″ã®å€ã¯ã©ããªã£ãŠããã®ããšãããšã以äžã®ããã«ãªã£ãŠããŸãã
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | 125000 | $162,500.00 | 30.00% | 500 | 1 | 10 | 2015 | Y |
1 | 552278 | Smith Plumbing | 920000 | $1,012,000.00 | 10.00% | 700 | 6 | 15 | 2014 | Y |
2 | 23477 | ACME Industrial | 50000 | $62,500.00 | 25.00% | 125 | 3 | 29 | 2016 | Y |
3 | 24900 | Brekke LTD | 350000 | $490,000.00 | 4.00% | 75 | 10 | 27 | 2015 | Y |
4 | 651029 | Harbor Co | 15000 | $12,750.00 | -15.00% | Closed | 2 | 2 | 2014 | N |
ããããªæ°å€ã«ãªã£ãŠããããšãããããŸãã
åæ§ã«”2017″ã”Percent Growth”ãæ°å€ã«å€æããŠãããŸãã “Percent Growth” ã¯å ã®çŸåçããå°æ°ã®å€ã®æ¹ãæ±ããããã®ã§100ã§å²ã£ãŠãããŸãã
df['2017'] = df['2017'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
df['Percent Growth'] = df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100
“Jan Units”ãåæ§ã«æ°å€ã«å€æããŠããããšãããšããšã©ãŒãçºçããŠããŸããŸããæåŸã®é ç®ããClosedããšãªã£ãŠãããæ°å€å€æã§ããªãããã§ãã
ãããã£ãã±ãŒã¹ã§ã¯df.astype()ã§ã¯ãªãpd.to_numeric()ãçšããããšã§ãããæè»ã«å¯Ÿå¿ããããšãã§ããŸããåŒæ°ã«errors=”coerce”ãè¿œå ããããšã§ãå€æã§ããªãé ç®ã¯æ¬ æã«å€æããããšãã§ããŸãã
df['Jan Units'] = pd.to_numeric(df['Jan Units'], errors='coerce')
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | 125000 | 162500 | 0.3 | 500 | 1 | 10 | 2015 | Y |
1 | 552278 | Smith Plumbing | 920000 | 1012000 | 0.11 | 700 | 6 | 15 | 2014 | Y |
2 | 23477 | ACME Industrial | 50000 | 62500 | 0.25 | 125 | 3 | 29 | 2016 | Y |
3 | 24900 | Brekke LTD | 350000 | 490000 | 0.04 | 75 | 10 | 27 | 2015 | Y |
4 | 651029 | Harbor Co | 15000 | 12750 | -0.15 | nan | 2 | 2 | 2014 | N |
æ¬ æå€ãåæã«ãããŠã¯å¥œãŸãããªãã®ã§ãäœçãåŠçœ®ãããå¿ èŠããããŸããä»åã¯æ¬ æã0ã§çœ®ãæããåŠçãè¿œå ããŸãã
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | 125000 | 162500 | 0.3 | 500 | 1 | 10 | 2015 | Y |
1 | 552278 | Smith Plumbing | 920000 | 1012000 | 0.11 | 700 | 6 | 15 | 2014 | Y |
2 | 23477 | ACME Industrial | 50000 | 62500 | 0.25 | 125 | 3 | 29 | 2016 | Y |
3 | 24900 | Brekke LTD | 350000 | 490000 | 0.04 | 75 | 10 | 27 | 2015 | Y |
4 | 651029 | Harbor Co | 15000 | 12750 | -0.15 | 0 | 2 | 2 | 2014 | N |
ããŒã¿åãå€æããïŒæååâæ¥ä»ïŒ
次㯔Month”ã”Day”ã”Year”ãæ¥ä»åã«å€æããŸãã
æ¥ä»åã«å€æããããã«ã¯ã幎ææ¥ããŸãšããå¿ èŠããããŸãããªã®ã§ãå ã®ããŒã¿ãã¬ãŒã ã®åãæ¥ä»åã«ããã®ã§ã¯ãªããæ°èŠã«”Time”åãäœæããŸãã
æ¥ä»åãžã®å€æ㯠pd.to_datetime ()ãçšããŸãã
df["Time"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | Time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | 125000 | 162500 | 0.3 | 500 | 1 | 10 | 2015 | Y | 2015/1/10 |
1 | 552278 | Smith Plumbing | 920000 | 1012000 | 0.1 | 700 | 6 | 15 | 2014 | Y | 2014/6/15 |
2 | 23477 | ACME Industrial | 50000 | 62500 | 0.25 | 125 | 3 | 29 | 2016 | Y | 2016/3/29 |
3 | 24900 | Brekke LTD | 350000 | 490000 | 0.04 | 75 | 10 | 27 | 2015 | Y | 2015/10/27 |
4 | 651029 | Harbor Co | 15000 | 12750 | -0.15 | 0 | 2 | 2 | 2014 | N | 2014/2/2 |
“Time”åã¯datetimeåã§ãã
ããŒã¿åãå€æããïŒæååâboolåïŒ
次㯔Active”ãboolåã«å€æããŸããå€æããã«ããããnumpyïŒnpïŒã䜿çšããŸããnp.where()ã§ã©ã®æ¡ä»¶ãçïŒåœïŒãšããããšããã«ãŒã«ãæå®ããå€æããŸãã
以äžã§ã¯ã”Y”ãTrueããã以å€ãFalseãšããŸãã
import numpy as np
df["Active"] = np.where(df["Active"] == "Y", True, False)
Customer Number | Customer Name | 2016 | 2017 | Percent Growth | Jan Units | Month | Day | Year | Active | Time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | Quest Industries | 125000 | 162500 | 0.3 | 500 | 1 | 10 | 2015 | TRUE | 2015/1/10 |
1 | 552278 | Smith Plumbing | 920000 | 1012000 | 0.1 | 700 | 6 | 15 | 2014 | TRUE | 2014/6/15 |
2 | 23477 | ACME Industrial | 50000 | 62500 | 0.25 | 125 | 3 | 29 | 2016 | TRUE | 2016/3/29 |
3 | 24900 | Brekke LTD | 350000 | 490000 | 0.04 | 75 | 10 | 27 | 2015 | TRUE | 2015/10/27 |
4 | 651029 | Harbor Co | 15000 | 12750 | -0.15 | 0 | 2 | 2 | 2014 | FALSE | 2014/2/2 |
ããŒã¿åãèŠãŠã¿ããšãboolåã«å€æã§ããŠããããšãããããŸãã
å€æåŸã®ããŒã¿ãã¬ãŒã ã¯ä»¥äžã®ããã«ãªããŸããã
“Active”ã¯TRUEãšFALSEã«å€æãããŸããã
åŠçããŸãšãã
ãããŸã§ã®åŠçã¯ãPandasã«CSVãèªã¿èŸŒãéã«ãŸãšããŠèšè¿°ããããšãã§ããŸãã
åçŽã«åãå€æããã ãã®åã¯åŒæ°dtypeã«ãå å·¥åŠçãå ããéã¯åŒæ°converterã§åŠçãèšè¿°ããŸãã
import pandas as pd
import numpy as np
df = pd.read_csv("input/sales_data_types.csv",
dtype={'Customer Number':'int'},
converters={'2016': lambda x: float(x.replace('$', '').replace(',', '')),
'2017': lambda x: float(x.replace('$', '').replace(',', '')),
'Percent Growth': lambda x: float(x.replace('%', ''))/100,
'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
'Active': lambda x: np.where(x == "Y", True, False)
})
ããããããšã§ãååã§ã©ã®ãããªå€æåŠçãããŠããã®ãããããããããªããŸããã
Pythonã§ããŒã¿ãµã€ãšã³ã¹ãããªã
Pythonã§ããŒã¿ãµã€ãšã³ã¹ããããªãã以äžã®æžç±ãããããã§ããPandasãmatplotlibãNumpyãscikit-learnãšãã£ãããŒã¿ãµã€ãšã³ã¹ã«å¿ èŠãªã©ã€ãã©ãªããäœç³»ç«ãŠãŠäžéãåŠã¶ããšãã§ããŸãã
ãããå€æ®µé«ãã§ããããã1åã§ååãšããå 容ã»ããªã¥ãŒã ãªã®ã§ãæã¯ããªããšæããŸã^^
ååŠè ã®æ¹ã«ã¯ãã¡ãããªã¹ã¹ã¡ã§ã^^
ãŸãšã
Pandasã®ããŒã¿ãã¬ãŒã ã®ããŒã¿ãã¬ãŒã ãå€æããæ¹æ³ã玹ä»ããŸãããäŸã«æããããŒã¿ã®ããã«ããã®ãŸãŸã§ã¯åæã«äœ¿ããªããšãã£ãã±ãŒã¹ã¯å°ãªããããŸããã
åæãå§ããŠæããããããªããšã©ãŒãåé¿ããããã«ããæåã«ããŒã¿åã®ç¢ºèªãå€æãå®æœããŠãã ããã
ã§ã¯ã§ã¯ð