무던히 하다보면 느는

[[위키독스]] SQL 데이터 추출 (Python 데이터 분석 실무) 본문

🌵Python 데이터 분석 실무🌵

[[위키독스]] SQL 데이터 추출 (Python 데이터 분석 실무)

무던히 하다보면 느는 2022. 7. 29. 21:40

 

SQL 데이터 추출

In [1]:
! pip install -U pandasql
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandasql
  Downloading pandasql-0.7.3.tar.gz (26 kB)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.21.6)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.3.5)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.7/dist-packages (from pandasql) (1.4.39)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pandasql) (2022.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->pandasql) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->pandasql) (1.15.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->pandasql) (4.12.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.7/dist-packages (from sqlalchemy->pandasql) (1.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy->pandasql) (3.8.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->sqlalchemy->pandasql) (4.1.1)
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... done
  Created wheel for pandasql: filename=pandasql-0.7.3-py3-none-any.whl size=26784 sha256=48b4437a2f6724e8f045ecf983f8c6c4075a926b495e92f21af7e33f077e800f
  Stored in directory: /root/.cache/pip/wheels/5c/4b/ec/41f4e116c8053c3654e2c2a47c62b4fca34cc67ef7b55deb7f
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3
In [3]:
import pandas as pd
In [61]:
df = pd.read_csv('./doc_use_log.csv')\
                .sample(frac=0.01,replace=False) # 샘플링 전체 데이터에서의 0.01만 추출
In [50]:
# 쿼리!
from pandasql import *

q = "SELECT * FROM df WHERE ext = 'PDF' AND ismydoc = '0' LIMIT 10"
In [7]:
print (sqldf(q, locals()).to_string())
  actiontype  ismydoc  ext                         sessionid documentposition   datetime
0       OPEN        0  PDF  dc5f730fcd9c635e92979f3730d6e694         OTHERAPP  2016.7.22
1       OPEN        0  PDF  0380197dc2f17dbc7ae9b4ea65e440de         OTHERAPP  2016.7.28
2       OPEN        0  PDF  c811bcc12ce9cfb6035f5fbd0de06460         OTHERAPP  2016.7.29
3      CLOSE        0  PDF  0fe495a1fd7237f2599a428bf1d27152         OTHERAPP   2016.7.3
4       OPEN        0  PDF  8409348eace67f3ffadb5070992f28c3         OTHERAPP  2016.7.20
5       OPEN        0  PDF  3df2da7ae48f30d92f6938423b1687eb         OTHERAPP   2016.7.3
6      CLOSE        0  PDF  3a1f91a5cc1ad334d5466d21c00742f1         OTHERAPP  2016.7.16
7       OPEN        0  PDF  47ef392ce0e68549a02bc7852ddd4515         OTHERAPP  2016.7.26
8      CLOSE        0  PDF  3dda9f6a5e275d24ecfa9596702aaa8b         OTHERAPP   2016.7.2
9      CLOSE        0  PDF  0654e1566976e8842e7ac2d9ea688a13         OTHERAPP   2016.7.7
In [8]:
q = """
  SELECT
  ext, count(ext) as count, count(distinct sessionid) as unq_sess
  FROM df
  GROUP BY ext
  ORDER BY count DESC
  """
In [9]:
print (sqldf(q,locals()).to_string())
     ext  count  unq_sess
0    PDF    798       794
1   DOCX    592       585
2   XLSX    531       523
3    DOC    249       248
4    XLS    248       247
5    HWP    246       246
6   PPTX    150       148
7    TXT    111       110
8    PPT     59        59
9   PPSX     27        27
10   ODT      6         6
11   PPS      2         2
In [62]:
# 테이블 조인
ios = pd.read_csv('./ios.csv') 

q = """
  SELECT A.*,B.flag
  FROM df A
  LEFT JOIN
  (
    SELECT sessionid, flag
    FROM ios
  ) B
  ON A.sessionid = B.sessionid
  WHERE B.flag='iOS'
  """
In [19]:
print (sqldf(q,locals()).to_string())
  actiontype  ismydoc   ext                         sessionid documentposition   datetime flag
0       OPEN        0  XLSX  18bd4e98e9a46eccbaa01082adeca66b         OTHERAPP  2016.7.30  iOS
1       OPEN        0  XLSX  c4441d6635e2f8452f45245d111402d3         OTHERAPP   2016.7.7  iOS

연습문제

Question 1

  • df 테이블의 Action Type 값(항목)별 유니크한 세션수는?
  • 유니크 세션수 기준으로 내림차순 정렬하기
In [21]:
# distinct..
q = """
  SELECT count(sessionid) as count
  FROM df
  GROUP BY actiontype
  ORDER BY count DESC
  """
In [22]:
print(sqldf(q,locals()).to_string())
   count
0   1470
1   1319
2    107
3     89
4     32
5      2

distinct 위치는 count 안, 기준 앞

정답

In [54]:
q = """
  SELECT actiontype, count(distinct sessionid) as session_cnt_uniq
  FROM df
  GROUP BY actiontype
  ORDER BY session_cnt_uniq DESC
  """

print(sqldf(q,locals()).to_string())
    actiontype  session_cnt_uniq
0         OPEN              1503
1        CLOSE              1315
2         SAVE                79
3       SAVEAS                57
4        RESET                55
5  EXPORT_SAME                 1
In [55]:
# 판다스로 나타내기
df.groupby("actiontype")['sessionid'].nunique().sort_values(ascending=False).reset_index()
Out[55]:
actiontypesessionid012345
OPEN 1503
CLOSE 1315
SAVE 79
SAVEAS 57
RESET 55
EXPORT_SAME 1

Question 2

  • ismydoc이 1(True)인 경우에 한해, 날짜별 세션수의 유니크 빈도 구하기
  • 유니크 빈도가 가장 큰 top 5 날짜 확인하기
In [71]:
q = """
  SELECT datetime, count(distinct sessionid) as session_cnt_uniq
  FROM df
  WHERE ismydoc = '1'
  GROUP BY datetime
  ORDER BY session_cnt_uniq DESC
  LIMIT 5 
  """

LIMIT는 = 없이

오잉 결과값이 왜 다르지?

In [73]:
print(sqldf(q,locals()).to_string())
    datetime  session_cnt_uniq
0  2016.7.12                58
1  2016.7.19                58
2  2016.7.14                56
3  2016.7.20                54
4  2016.7.13                53
In [64]:
df
Out[64]:
actiontypeismydocextsessioniddocumentpositiondatetime2497914814621032349459186895...607872812261345663034817717
CLOSE True PDF c94aab1f65add34cd46eb7ac57fbab50 MYPOLARISDRIVE 2016.7.6
OPEN True TXT aaa1b9cee6214472c8b4ee1895e62c34 MYPOLARISDRIVE 2016.7.9
OPEN True XLSX f99b8ecdde178a2ac8e47fe64dd54be6 OTHERAPP 2016.7.15
CLOSE False DOCX 81380669b71c9d2841e3ba263c48ff14 OTHERAPP 2016.7.25
OPEN True PPT 72250ea3ac5f1b7f7b6183f2d39662da OTHERAPP 2016.7.4
... ... ... ... ... ...
OPEN True XLSX 34def873d6fe285a2c58066353b9e7e6 MYPOLARISDRIVE 2016.7.26
OPEN False DOCX 7e21f84d2a30e27192e10516487d7f75 OTHERAPP 2016.7.27
OPEN False PDF bc52910bc81455b0afcc667c795f3818 OTHERAPP 2016.7.14
OPEN True DOCX afe6e4eaa5e047e18a629dd1e6b0931b OTHERAPP 2016.7.11
OPEN False PDF fa05e6b496098aa0de080137b08da378 OTHERAPP 2016.7.30

3019 rows × 6 columns

In [74]:
df.query("ismydoc == 1").groupby("datetime")['sessionid'].nunique().sort_values(ascending=False).head(5).reset_index()
Out[74]:
datetimesessionid01234
2016.7.19 58
2016.7.12 58
2016.7.14 56
2016.7.20 54
2016.7.13 53

 

정답이랑 따라서 구문을 쳤는데도 결과값이 다르게 나온다. 왤까?

그리고 "ismydoc =='1'" --> "ismydoc ==1"