Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
Tags
- 오류
- 에러
- safe mode 해제
- 전처리
- 살려줘
- D2E8DA72
- vlookup
- 이전날짜제거
- 태블로
- 맵차트
- TabPy
- 파이썬
- mysql
- 정제
- random.uniform
- 데이터전처리
- 함수
- 외부접속허용
- 북마크
- split
- 아나콘다
- putty
- Tableau
- trim
- concat
- 태블로퍼블릭
- Delete
- Def
- Join
- SQL
Archives
- Today
- Total
무던히 하다보면 느는
[[위키독스]] SQL 데이터 추출 (Python 데이터 분석 실무) 본문
SQL 데이터 추출
In [1]:
! pip install -U pandasql
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandasql
Downloading pandasql-0.7.3.tar.gz (26 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.21.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.3.5)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.4.39)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pandasql) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pandasql) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.15.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->pandasql) (4.12.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->pandasql) (1.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy->pandasql) (3.8.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy->pandasql) (4.1.1)
Building wheels for collected packages: pandasql
Building wheel for pandasql (setup.py) ... done
Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26784 sha256=48b4437a2f6724e8f045ecf983f8c6c4075a926b495e92f21af7e33f077e800f
Stored in directory: /root/.cache/pip/wheels/5c/4b/ec/41f4e116c8053c3654e2c2a47c62b4fca34cc67ef7b55deb7f
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3
In [3]:
import pandas as pd
In [61]:
df = pd.read_csv('./doc_use_log.csv')\
.sample(frac=0.01,replace=False) # 샘플링 전체 데이터에서의 0.01만 추출
In [50]:
# 쿼리!
from pandasql import *
q = "SELECT * FROM df WHERE ext = 'PDF' AND ismydoc = '0' LIMIT 10"
In [7]:
print (sqldf(q, locals()).to_string())
actiontype ismydoc ext sessionid documentposition datetime
0 OPEN 0 PDF dc5f730fcd9c635e92979f3730d6e694 OTHERAPP 2016.7.22
1 OPEN 0 PDF 0380197dc2f17dbc7ae9b4ea65e440de OTHERAPP 2016.7.28
2 OPEN 0 PDF c811bcc12ce9cfb6035f5fbd0de06460 OTHERAPP 2016.7.29
3 CLOSE 0 PDF 0fe495a1fd7237f2599a428bf1d27152 OTHERAPP 2016.7.3
4 OPEN 0 PDF 8409348eace67f3ffadb5070992f28c3 OTHERAPP 2016.7.20
5 OPEN 0 PDF 3df2da7ae48f30d92f6938423b1687eb OTHERAPP 2016.7.3
6 CLOSE 0 PDF 3a1f91a5cc1ad334d5466d21c00742f1 OTHERAPP 2016.7.16
7 OPEN 0 PDF 47ef392ce0e68549a02bc7852ddd4515 OTHERAPP 2016.7.26
8 CLOSE 0 PDF 3dda9f6a5e275d24ecfa9596702aaa8b OTHERAPP 2016.7.2
9 CLOSE 0 PDF 0654e1566976e8842e7ac2d9ea688a13 OTHERAPP 2016.7.7
In [8]:
q = """
SELECT
ext, count(ext) as count, count(distinct sessionid) as unq_sess
FROM df
GROUP BY ext
ORDER BY count DESC
"""
In [9]:
print (sqldf(q,locals()).to_string())
ext count unq_sess
0 PDF 798 794
1 DOCX 592 585
2 XLSX 531 523
3 DOC 249 248
4 XLS 248 247
5 HWP 246 246
6 PPTX 150 148
7 TXT 111 110
8 PPT 59 59
9 PPSX 27 27
10 ODT 6 6
11 PPS 2 2
In [62]:
# 테이블 조인
ios = pd.read_csv('./ios.csv')
q = """
SELECT A.*,B.flag
FROM df A
LEFT JOIN
(
SELECT sessionid, flag
FROM ios
) B
ON A.sessionid = B.sessionid
WHERE B.flag='iOS'
"""
In [19]:
print (sqldf(q,locals()).to_string())
actiontype ismydoc ext sessionid documentposition datetime flag
0 OPEN 0 XLSX 18bd4e98e9a46eccbaa01082adeca66b OTHERAPP 2016.7.30 iOS
1 OPEN 0 XLSX c4441d6635e2f8452f45245d111402d3 OTHERAPP 2016.7.7 iOS
연습문제
Question 1
- df 테이블의 Action Type 값(항목)별 유니크한 세션수는?
- 유니크 세션수 기준으로 내림차순 정렬하기
In [21]:
# distinct..
q = """
SELECT count(sessionid) as count
FROM df
GROUP BY actiontype
ORDER BY count DESC
"""
In [22]:
print(sqldf(q,locals()).to_string())
count
0 1470
1 1319
2 107
3 89
4 32
5 2
distinct 위치는 count 안, 기준 앞
정답
In [54]:
q = """
SELECT actiontype, count(distinct sessionid) as session_cnt_uniq
FROM df
GROUP BY actiontype
ORDER BY session_cnt_uniq DESC
"""
print(sqldf(q,locals()).to_string())
actiontype session_cnt_uniq
0 OPEN 1503
1 CLOSE 1315
2 SAVE 79
3 SAVEAS 57
4 RESET 55
5 EXPORT_SAME 1
In [55]:
# 판다스로 나타내기
df.groupby("actiontype")['sessionid'].nunique().sort_values(ascending=False).reset_index()
Out[55]:
actiontypesessionid012345
OPEN | 1503 |
CLOSE | 1315 |
SAVE | 79 |
SAVEAS | 57 |
RESET | 55 |
EXPORT_SAME | 1 |
Question 2
- ismydoc이 1(True)인 경우에 한해, 날짜별 세션수의 유니크 빈도 구하기
- 유니크 빈도가 가장 큰 top 5 날짜 확인하기
In [71]:
q = """
SELECT datetime, count(distinct sessionid) as session_cnt_uniq
FROM df
WHERE ismydoc = '1'
GROUP BY datetime
ORDER BY session_cnt_uniq DESC
LIMIT 5
"""
LIMIT는 = 없이
오잉 결과값이 왜 다르지?
In [73]:
print(sqldf(q,locals()).to_string())
datetime session_cnt_uniq
0 2016.7.12 58
1 2016.7.19 58
2 2016.7.14 56
3 2016.7.20 54
4 2016.7.13 53
In [64]:
df
Out[64]:
actiontypeismydocextsessioniddocumentpositiondatetime2497914814621032349459186895...607872812261345663034817717
CLOSE | True | c94aab1f65add34cd46eb7ac57fbab50 | MYPOLARISDRIVE | 2016.7.6 | |
OPEN | True | TXT | aaa1b9cee6214472c8b4ee1895e62c34 | MYPOLARISDRIVE | 2016.7.9 |
OPEN | True | XLSX | f99b8ecdde178a2ac8e47fe64dd54be6 | OTHERAPP | 2016.7.15 |
CLOSE | False | DOCX | 81380669b71c9d2841e3ba263c48ff14 | OTHERAPP | 2016.7.25 |
OPEN | True | PPT | 72250ea3ac5f1b7f7b6183f2d39662da | OTHERAPP | 2016.7.4 |
... | ... | ... | ... | ... | ... |
OPEN | True | XLSX | 34def873d6fe285a2c58066353b9e7e6 | MYPOLARISDRIVE | 2016.7.26 |
OPEN | False | DOCX | 7e21f84d2a30e27192e10516487d7f75 | OTHERAPP | 2016.7.27 |
OPEN | False | bc52910bc81455b0afcc667c795f3818 | OTHERAPP | 2016.7.14 | |
OPEN | True | DOCX | afe6e4eaa5e047e18a629dd1e6b0931b | OTHERAPP | 2016.7.11 |
OPEN | False | fa05e6b496098aa0de080137b08da378 | OTHERAPP | 2016.7.30 |
3019 rows × 6 columns
In [74]:
df.query("ismydoc == 1").groupby("datetime")['sessionid'].nunique().sort_values(ascending=False).head(5).reset_index()
Out[74]:
datetimesessionid01234
2016.7.19 | 58 |
2016.7.12 | 58 |
2016.7.14 | 56 |
2016.7.20 | 54 |
2016.7.13 | 53 |
정답이랑 따라서 구문을 쳤는데도 결과값이 다르게 나온다. 왤까?
그리고 "ismydoc =='1'" --> "ismydoc ==1"
'🌵Python 데이터 분석 실무🌵' 카테고리의 다른 글
[[위키독스]] Spark & 로그 정의 및 파싱 (Python 데이터 분석 실무) (0) | 2022.07.29 |
---|