6/30/2021

Joining data using Pandas

Merge 2 tables

$ New_dataframe = dataframe1.merge(dataframe2, on = 'same_column_name', suffixes = ('_dataframe1', '_dataframe2'))

 Merge 3 tables

$ New = df1.merge(df2, on = ['col1', 'col2'])   \

      .merge (df3, on = 'col3', suffixes = ('_df1', '_df2'))  \

      .merge(df4, on = 'col4')

 Left join

$ New_dataframe = dataframe1.merge(dataframe2, on = 'same_column_name', how = ‘left’)


Right join

$ New_dataframe = dataframe1.merge(dataframe2, on = 'same_column_name', how = ‘right’, left_on = ‘id’, right_on = ‘tv_id’)


Outer join

$ New_dataframe = dataframe1.merge(dataframe2, on = 'same_column_name', how = ‘outer’)


$ pd = pd.read_csv(“csv_file.csv”, index =[“idx1”, “idx2”])



Join table vertically 

$ pd. concat([t1,t2,t3 ], ignore_index = True)


$ pd. concat([t1,t2,t3 ], ignore_index = False, keys= [‘k1’,’k2’,’k3’]) : ignore_index = True: index 0 ~ n-1

W/diff col names

$ pd. concat([t1,t2], sort = True)

$ pd. concat([t1,t2], join = ‘inner’)

.append(): support ignore_index, sort , do not support join: always outer


Average by group

$ Avg_by_month = inv.groupby(level = 0).agg({‘total’:‘mean’})


Integrity validation:

.merge(...., validate = ‘one_to_one’)

‘One_to_many’  ‘Many_to_one’  ‘many_to_many’

.concat(verify_integrity = False)  :default is false


Merge_ordered:



 






Pd.merge_ordered(df1,df2, on = ‘sothing’, suffixes =(‘_df1’, ‘_df2’), filll_method = ‘ffill’) : ffill : forward fill

$ df.corr() : returns correlation matrix


Merge_asof()

$ pd.merge_asof(df1,df2, on = ‘date_time’, suffixes =(‘_df1’, ‘_df2’), direction= ‘forward’) 

Get closest value in the right table

Tool in Python

In Functions: 

$ def func()

$ Try:

      Xxxx

$ Except:

      Yunyun

$ Raise TypeError (‘xxx must be int’)

 

Word = ‘data’

It = iter(word)

Print (*it) : print all it data

Print (next(it)) : 下一项

file  = open(‘1.txt’)

It = iter(file)

Print (next(it))


with open('datacamp.csv') as datacamp


6/29/2021

Data types

 list

List [start(inclusive):end(exclusive)]
List[start:]
List[:end]

list_name.append('XXX') method. However, if you want to combine a list with another array type (list, set, tuple), you can use the list_name.extend('XXX') method on the list.

You can also use the list_name.index('XXX') method to find the position of an item in a list. You can then use that position to remove the item with the list_name.pop(position) method.

sorted(list_name)

Tuple

zip(list1, list2) 

name1, name2 = Zipped_Tuple[0]

for name1, name2 in zipped_tuple

zip() function does just that. It will return a list of tuples containing one element from each

When looping over a list, you can also track your position in the list by using the enumerate() function. The function returns the index of the list item you are currently on in the list and the list item itself.

$ letters = ['A', 'B', 'C']

$ names =['a', 'b', 'c']

$ for index, letter in enumerate(letters, start = 1):

      print(index, letter)

$ zip_list = list(zip(letters, names))

$ for z1, z2 in zip_list:

     print(z1, z2)

$ z = zip(letters, names)

$ print(*z)

Type()

Set

Set1.union(set2): combine all

Set1.intersection.(set2): overlapping

Check items: len(set_name)

Set1.add(something)

Set1.difference(set2)

diff = set1 - set2

Dictionary

$ dictionary.get(Key_name): returns “none” if key does not exist 

$ dictionary.get(Key_name, “Not Found”): returns “Not Found” if key does not exist 

$ dict.keys()

$ for x,y in dictionary.items()

      Dictionary[x] = y

$ sorted(dictionary, reverse=True): sort descend

$ dict1.update(dict2/tuple/list)

$ del dict[key_names]

$dict.pop(dict[key_names])

$ dict-removed = dict.pop(key-name)

To be safe $ dict-removed = dict.pop(key-name, {})

In: “2012” in dict: returns true or false

CSV

$ import csv

$ csvfile = open (‘xxxx.csv’,’r’)

$ for row in csv.reader(csvfile):

      Print(row)

 

Counter

$ from collections import Counter

$ XXX=Counter(xxx)

$ XXX.most_common(3)



defaultdict

$ from collections import defaultdict

$ XXX=defaultdict(default_type): works just like a dict

# Create an empty dictionary: ridership
ridership = {}

# Iterate over the entries
for date, stop, riders in entries:
    # Check to see if date is already in the ridership dictionary
    if date not in ridership:
        # Create an empty list for any missing date
        ridership[date] = []
    # Append the stop and riders as a tuple to the date keys list
    ridership[date].append((stop,riders))
    
# Print the ridership for '03/09/2016'
print(ridership['03/09/2016'])

 


# Import defaultdict
from collections import defaultdict

# Create a defaultdict with a default type of list: ridership
ridership = defaultdict(list)

# Iterate over the entries
for data, stop, riders in entries:
    # Use the stop as the key of ridership and append the riders to its value
    ridership[stop].append(riders)
    
# Print the first 10 items of the ridership dictionary
print(list(ridership.items())[:10])


OrderedDict

$ from collections import OrderedDict

$ XXX= OrderedDict()

$ for item in dict:

   XXX[itme['Key_name']]=item

$ print(list(XXX.items())[:3])
$ print(XXX.popitem())   print in reverse insertion order

$ print(XXX.popitem(last = False)) : return items in insertion order

 

namedtuple

$ from collection import namedtuple

$ name = namedtuple('letters', ['A', 'B' ,'C'])

$ letters = [ ]

$ for letter in dict:

       detail = name(letter['A'], letter['B'], letter['C'])

       letters.append(detail)

$ print (letters[0])

# Import namedtuple from collections
from collections import namedtuple

# Create the namedtuple: DateDetails
DateDetails = namedtuple('DateDetails', ['date', 'stop', 'riders'])

# Create the empty list: labeled_entries
labeled_entries = []

# Iterate over the entries list
for date, stop, riders in entries:
    # Append a new DateDetails namedtuple instance for each entry to labeled_entries

    labeled_entries.append(DateDetails(date,stop,riders))
    
# Print the first 5 items in labeled_entries
print(labeled_entries[:5])
# Iterate over the first twenty items in labeled_entries
for item in labeled_entries[:20]:
    # Print each item's stop
    print(item.stop)

    # Print each item's date
    print(item.date)

    # Print each item's riders
    print(item.riders)
 
 datetime
$ from datetime import datetime
$ print(XX_date)
$ dt = datetime.strptime(XXX_date, '%m/%d/%Y' )
$ dt.strftime( '%m/%d/%Y' )
$ dt.isoformat(): ISO format

$ daily = defaultdict(int)

$ for item in certain_dict:

       item_dt = datetime.strptime(item[4], '%m/%d/%Y')

       daily[item_dt.day] += 1

$ local_dt = datetime.now() :returns time on the machine

$ local_dt = datetime.utcnow() : UTC time

 

$ from pytz import timezone 

$ ny_tz = timezone('US/Estern')

$ ny_dt = dt_file_name.replace(tzinfo=ny_tz) : you can make a datetime object "aware" by passing a timezone as the tzinfo keyword argument to the .replace() method on a datetime instance.

$ la_dt = ny_dt.astimezone(la_tz) :convert ny_dt to la timezone


$ from datetime import timedelta

$ flashback = timedelta(days=90)

$ print(record_dt)

$ print (record_dt + flashback)

$ print (record_dt - flashback)

$ time_diff = record_dt - record2_dt

$ type (time_diff)  : datetime.timedelta


pendulum

$ import pendulum

$ occur = voilation[4]  + ' ' + violation[5] + 'M' 

$occur_dt = pendulum.parse(occur, tz = 'US/Eastern')

$ occur_dt = pendulum.parse(occur, strict = False)

$ for day_dt in day_dts:

       print(day_dt.in_timezone('Asian/Tokyo'))

$print (pendulun.now('Asian/Tokyo'))

$ print(diff.in_words())

$ print(diff.in_days())

$ print(diff.in_hours())


csv

$ for row in csv.DictReader(csvfile)

    print(row)

$ deleted_item = dict.pop(key_name_deleted)


6/22/2021

Unix

$ cp -p seasonal/summer.csv seasonal/spring.csv backup/

$ mv seasonal/spring.csv seasonal/summer.csv backup/
$ mv xxx.csv yyy.csv
$ mv can rename directory too, same as rename a file
$ mkdir yearly$ mkdir -p yearly/2017
$ mv ~/people/agarwal.txt /tmp/scratch
~: home
/XXX: relative directory
$ cat course.txt
$ less 
$ more
$head -n 5 seasonal/winter.csv (select first 5 rows)
$ ls -R -F /home/repl
(R means "recursive", -F that prints a / after the name of every directory and a * after the name of every runnable program.)
$ tail -n +7 seasonal/spring.csv 
(display all but the first six rows of seasonal/spring.csv) 
$man less (less manual)
$ cut -f 2-5,8 -d , values.csv 
means"select columns 2 through 5 and columns 8,
using comma as the separator".
cut uses -f (meaning "fields") to specify columns
and -d (meaning "delimiter") to specify the separator.
$ !head (rerun head command in history)
$ history
$ !3  runs the 3rd command in history

grep selects lines according to what they contain,
$grep bicuspid seasonal/winter.csv
prints lines from winter.csv that contain "bicuspid"
  • -c: print a count of matching lines rather than the lines themselves
  • -h: do not print the names of files when searching multiple files
  • -i: ignore case (e.g., treat "Regression" and "regression" as matches)
  • -l: print the names of files that contain matches, not the matches
  • -n: print line numbers for matching lines
  • -v: invert the match, i.e., only show lines that don't match

grep -c incisor seasonal/autumn.csv seasonal/winter.csv

paste xxx.csv yyy.csv 最后几行开头多了一列 “,”

head -n 5 seasonal/summer.csv > top.csv
 
pipe:
head -n 5 seasonal/summer.csv | tail -n 3
The pipe symbol tells the shell to use the output of the command on the left
as the input to the command on the right.
wc (short for "word count") prints the number of characters, words, and lines in a file.
You can make it print only one of these using -c, -w, or -l respectively  
$ grep 2017-07 seasonal/spring.csv|wc -l 把带2017-07的行都打出来
wildcards:*
cut -d , -f 1 seasonal/*
  • ? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
  • [...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
  • {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.
  

 $ cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort -r

flags -n and -r can be used to sort numerically and reverse the order of its output, while -b tells it to ignore leading blanks and -f tells it to fold case (i.e., be case-insensitive).

$ grep -v tooth|sort | uniq -c > ttt.csv

stop running: ctrl+C

 

$ set | grep XXX
$ echo $USER: print the value of USER
$ echo USER: print USER on the screen
$ testing=seasonal/winter.csv : create a variable
$ head -n 1 $testing: remember $ to print
$ for filetype in gif jpg png; do echo $filetype; done
$ for filename in seasonal/*.csv; do echo $filename; done
$ datasets=seasonal/*.csv  : record datasets
$ for filename in $datasets; do echo $filename; done 
$ for file in seasonal/*.csv; do grep 2017-07 $file | tail -n 1; done
move 带空格的东西:$ mv 'July 2017.csv' '2017 July data.csv' 

$ nano filename: will open filename for editing
(or create it if it doesn't already exist).
  • Ctrl + K: delete a line.
  • Ctrl + U: un-delete a line.
  • Ctrl + O: save the file ('O' stands for 'output'). You will also need to press Enter to confirm the filename!
  • Ctrl + X: exit the editor.
$ nano teeth.sh
$ bash teeth.sh > teeth.out
$ cat teeth.out 
 
$@ (dollar sign immediately followed by at-sign) to mean "all of the command-line parameters given to the script")
$ nano count-records.sh
in count-records.sh:
tail -q -n +2 $@| wc -l 
$ bash count-records.sh seasonal/*.csv > num-records.out 

process a single argument?

column.sh:
cut -d , -f $2 $1
$ bash column.sh seasonal/autumn.csv 1 

CTRL + K to cut the line, then CTRL + U twice to paste two copies of it.
 
# Print the first and last data records of each file.
for filename in $@
do
    head -n 2 $filename | tail -n 1
    tail -n 1 $filename
done 
 
 

6/19/2021

Git and Git Hub

Git

$ git diff -r HEAD. The -r flag means "compare to a particular revision", and HEAD is a shortcut meaning "the most recent commit".

$ git diff -r HEAD path/to/file

$ git add data/mine.csv
$ git commit -m "Program appears to have become self-aware.“
$ git commit --amend - m "new message"
$ git log (path/file.csv)
$ git show (first characters from a commit)

 

Used in git hub!

 - git status: shows status of tracked and untracked files

- git add [filename]: adds file to current commit - git commit -m '[message]': add commit locally - git push origin master: push commit to your github repository

$ git diff data/northern.csv  

diff --git a/report.txt b/report.txt

index e713b17..4c0742a 100644
--- a/report.txt
+++ b/report.txt
@@ -1,4 +1,5 @@
-# Seasonal Dental Surgeries 2017-18
+# Seasonal Dental Surgeries (2017) 2017-18
+# TODO: write new summary 
  • The command used to produce the output (in this case, diff --git). In it, a and b are placeholders meaning "the first version" and "the second version".
  • An index line showing keys into Git's internal database of changes. We will explore these in the next chapter.
  • --- a/report.txt and +++ b/report.txt, wherein lines being removed are prefixed with - and lines being added are prefixed with +.
  • A line starting with @@ that tells where the changes are being made. The pairs of numbers are start line and number of lines (in that section of the file where changes occurred). This diff output indicates changes starting at line 1, with 5 lines where there were once 4.
  • A line-by-line listing of the changes with - showing deletions and + showing additions (we have also configured Git to show deletions in red and additions in green). Lines that haven't changed are sometimes shown before and after the ones that have in order to give context; when they appear, they don't have either + or - in front of them.

Git's equivalent of a relative path?

The special label HEAD, which we saw in the previous chapter, always refers to the most recent commit. The label HEAD~1 then refers to the commit before it, while HEAD~2 refers to the commit before that,

Show who changed what in what file

$ git log displays the overall history of a project or file. 

$ git annotate report.txt :
shows who made the last change to each line of a file and when. 

 

$git diff abc123..def456:shows the differences between the commits abc123 and def456

$git diff HEAD~1..HEAD~3: shows the differences between the state of the repository one commit in the past and its state three commits in the past.

 How to Add New File

$ git status

$ git add XXX.csv

$ git commit -m "Test" 

Ignore

.gitignore

if it contains

pdf
*.pyc
backup
All directory or file are ignored
git clean -n will show you a list of files that are in the repository,
but whose history Git is not currently tracking.
A similar command git clean -f will then delete those files.

Use this command carefully: git clean only works on untracked files, so by definition, their history has not been saved. If you delete them with git clean -f, they're gone for good.

$ git config

  • --system: settings for every user on this computer.
  • --global: settings for every one of your projects.
  • --local: settings for one specific project.

$ git config --list --local

$ git config --global user.email rep.loop@datacamp.com

 

$ git checkout -- filename: undo changes
reset staged files 
$ git reset HEAD path/to/file
$ git checkout -- path/to/file 


$ git log -3 report.txt: shows the last 3 commits

$ git checkout 2242bd report.txt: replace the current version of report.txt with the version that was committed on October 16. Notice that this is the same syntax that you used to undo the unstaged changes, except -- has been replaced by a hash.
$ git reset HEAD directory: reset all files in the directory
$ git checkout -- directory: restore to the previous version
for current directory : 
$ git checkout -- .
Branch: 
$ git branch: Check branch
$ git diff branch-1..branch-2: see diff
$ git checkout summary-statistics : to switch branch
$ git rm report.txt : to delete file 
$ git commit -m "Removing report" 
and $ls: just to make sure the file is gone
to Create a branch: $ git checkout -b branch-name 
$ git merge source-branch-name destination-branch-name
$ git init project-name: create a brand new repostory

$ git init /path/to/project : Turning an existing project into a Git repository?

$ git clone /existing/project new-project-name : copy existing repostory and name it

 

Find out where a cloned repository originated?

$ git remote

Git automatically creates a remote called origin that points to the original repository.

$ git remote add remote-name URL(directory): add more remotes

$ git remote rm remote-name : remove remote

Pull in changes from a remote repository?

$ git pull remote-name branch-name
gets everything in branch in the remote repository identified by remote
and merges it into the current branch of your local repository. 

Push changes to a remote repository?

$ git push remote-name branch-name 


 







 
 
 
 
 
 
 

Python Numpy

 # Numpy is imported, seed is set

 import numpy as np

# average

np.mean()

# Specify array of percentiles: percentiles
percentiles = np.array([2.5, 25, 50, 75, 97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile(some_array)
 
# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(some_array)

# Square the differences: diff_sq
diff_sq = differences ** 2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np
variance_np = np.var(some_array)
 
# Compute the variance: variance
variance = np.var(some_array)

# Print the square root of the variance
print(np.sqrt(variance))

# Print the standard deviation
print(np.std(some_array))
 
# Compute the covariance matrix: covariance_matrix
covariance_matrix = np.cov(versicolor_petal_length, versicolor_petal_width)

# Extract covariance of length and width of petals: petal_cov
petal_cov = covariance_matrix[0,1]

def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x,y)

# Return entry [0,1]
return corr_mat[0,1]
 
random_numbers = np.empty(100000)

# Set the seed
np.random.seed(123)

# Generate and print random float

print(np.random.random(size = 4)) 
print(np.random.rand())
print(np.random.randint(1,7))

# Take 10,000 samples out of the binomial distribution (n=100,p=0.05): n_defaults
n_defaults = np.random.binomial(100, 0.05, size=10000)
 
# Draw 10,000 samples out of Poisson distribution:
samples_poisson = np.random.poisson(10, size=10000) 
 
# Take 10,000 samples out of the normal distribution of mean=100, std=0.5
n_defaults = np.random.binomial(100, 0.5, size=10000)
 
# Draw samples out of an exponential distribution
sample = np.random.exponential(mean, size=1)

np.percentile(bs_slope_reps, [2.5, 97.5]))
 

# Initialize random_walk

random_walk = [0] 

for x in range(100) :

    step = random_walk[-1]

    dice = np.random.randint(1,7)

    if dice <= 2:

        # use max to make sure step can't go below 0

        step = max(0step -1)

    elif dice <= 5:

        step = step + 1

    else:

        step = step + np.random.randint(1,7)

    random_walk.append(step)

print(random_walk)


def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""
# Initialize array of replicates: bs_replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data,func)
return bs_replicates
 
 
def draw_bs_pairs(x, y, func, size=1):
"""Perform pairs bootstrap for a single statistic."""
# Set up array of indices to sample from: inds
inds = np.arange(len(x))
# Initialize replicates: bs_replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds,len(inds))
bs_x, bs_y = x[i], y[i]
bs_replicates[i] = func(bs_x, bs_y)
return bs_replicates

def permutation_sample(data1, data2):
"""Generate a permutation sample from two data sets."""
# Concatenate the data sets: data
data = np.concatenate((data1,data2))
# Permute the concatenated array: permuted_data
permuted_data = np.random.permutation(data)
# Split the permuted array into two: perm_sample_1, perm_sample_2
perm_sample_1 = permuted_data[:len(data1)]
perm_sample_2 = permuted_data[len(data1):]
return perm_sample_1, perm_sample_2

def draw_perm_reps(data_1, data_2, func, size=1):
"""Generate multiple permutation replicates."""
# Initialize array of replicates: perm_replicates
perm_replicates = np.empty(size)
for i in range(size):
# Generate permutation sample
perm_sample_1, perm_sample_2 = permutation_sample(data_1,data_2)
# Compute the test statistic
perm_replicates[i] = func(perm_sample_1, perm_sample_2)
return perm_replicates

# Compute mean of all forces: mean_force
mean_force = np.mean(forces_concat)
# Generate shifted arrays
force_a_shifted = force_a - np.mean(force_a) + mean_force
force_b_shifted = force_b - np.mean(force_b) + mean_force
# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_a = draw_bs_reps(force_a_shifted,np.mean, 10000)
bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, 10000)
# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_a-bs_replicates_b
# Compute and print p-value: p
p = np.sum(bs_replicates>=(np.mean(force_a-force_b))) / 10000
print('p-value =', p)

# Construct arrays of data: dems, reps
dems = np.array([True] * 153 + [False] * 91)
reps = np.array([True] * 136 + [False] * 35)
def frac_yea_dems(dems, reps):
"""Compute fraction of Democrat yea votes."""
frac = np.sum(dems) / len(dems)
return frac

# Acquire permutation samples: perm_replicates
perm_replicates = draw_perm_reps(dems,reps, frac_yea_dems, 10000)
# Compute and print p-value: p
p = np.sum(perm_replicates <= 153/244) / len(perm_replicates)
print('p-value =', p)

 # Compute the observed difference in mean inter-no-hitter times: nht_diff_obs

nht_diff_obs = diff_of_means(nht_dead,nht_live)
# Acquire 10,000 permutation replicates of difference in mean no-hitter time: perm_replicates
perm_replicates = draw_perm_reps(nht_dead, nht_live, diff_of_means, size=10000)
# Compute and print the p-value: p
p = np.sum(perm_replicates <= nht_diff_obs)/10000
print('p-val =', p)