Practical Data Science Tutorials, Advice and News with Interesting Data


Loading Data into Pandas: 5 Tips and Tricks You May or May Not Know

Datasets tips-and-tricks pandas data HTTP URL Excel table Wikipedia website scrape JSON clipboard CSV data-frame



If you’ve ever worked with Pandas in Python before, you’ll know that there are a bunch of different ways in which you can import tabulated data.

Pandas is perhaps one of the most widely used Python packages for loading, manipulating and exporting data. We’ve even explored this in a separate blog post.

While you may be familiar with the many different ways to read and write data using pandas, you may not be aware of a few shortcuts/hacks which may not be as obvious as you may think. This can make importing data much easier and faster.

This blog post explores just five of these techniques

1. Import CSV from URL

You’re probably very familiar with the pd.read_csv() method for importing Comma Separated Value files. But you may not know that it can be used to read data from a URL over HTTP.

For example, reading in a CSV file from a URL acts the same as a file stored locally.

>>> url = 'https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/sklearn/datasets/data/boston_house_prices.csv'
>>> df = pd.read_csv(url)
>>> df
         506  13 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13
0       CRIM  ZN      INDUS       CHAS        NOX         RM        AGE        DIS        RAD        TAX     PTRATIO           B       LSTAT        MEDV
1    0.00632  18       2.31          0      0.538      6.575       65.2       4.09          1        296        15.3       396.9        4.98          24
2    0.02731   0       7.07          0      0.469      6.421       78.9     4.9671          2        242        17.8       396.9        9.14        21.6
3    0.02729   0       7.07          0      0.469      7.185       61.1     4.9671          2        242        17.8      392.83        4.03        34.7
4    0.03237   0       2.18          0      0.458      6.998       45.8     6.0622          3        222        18.7      394.63        2.94        33.4
..       ...  ..        ...        ...        ...        ...        ...        ...        ...        ...         ...         ...         ...         ...
502  0.06263   0      11.93          0      0.573      6.593       69.1     2.4786          1        273          21      391.99        9.67        22.4
503  0.04527   0      11.93          0      0.573       6.12       76.7     2.2875          1        273          21       396.9        9.08        20.6
504  0.06076   0      11.93          0      0.573      6.976         91     2.1675          1        273          21       396.9        5.64        23.9
505  0.10959   0      11.93          0      0.573      6.794       89.3     2.3889          1        273          21      393.45        6.48          22
506  0.04741   0      11.93          0      0.573       6.03       80.8      2.505          1        273          21       396.9        7.88        11.9

Rather than storing all your files locally, you can access CSV files from a remote location. This saves you the hassle of having to download things manually.

2. Import HTML Table from a Website

If you’ve ever read the pandas documentation, you probably would’ve come across the pd.read_html() method. Much like, pd.read_csv() this too also has the additional feature of exporting data from a URL. What makes this even cooler is that you can provide a URL to a website with many tables and it will scrape them all.

For example, consider the Wikipedia article “Economy of the United Kingdom”. This article can be scraped as follows.

>>> url = 'https://en.wikipedia.org/wiki/Economy_of_the_United_Kingdom'
>>> df_tables = pd.read_html(url)

This article has a total of 33 tables.

>>> len(df_tables)
33

One of the tables contains historical GDP figures.

>>> df_tables[9]
    Year  GDP (in Bil. US$PPP)  GDP per capita (in US$ PPP)  GDP (in Bil. US$nominal)  GDP per capita (in US$ nominal) GDP growth (real) Inflation rate (in Percent) Unemployment (in Percent) Government debt (in % of GDP)
0   1980                 494.4                       8777.0                     604.0                          10722.2             -2.1%                       16.8%                      7.1%                         42.6%
1   1981                 537.4                       9536.1                     587.7                          10427.1             -0.7%                       12.2%                      9.7%                         44.8%
2   1982                 581.9                      10337.3                     558.7                           9925.6              2.0%                        8.5%                     10.7%                         43.1%
3   1983                 630.1                      11187.9                     532.5                           9455.1              4.2%                        5.2%                     11.5%                         41.9%
4   1984                 667.4                      11832.2                     504.6                           8944.9              2.2%                        4.4%                     11.8%                         42.2%
5   1985                 716.9                      12677.0                     536.9                           9494.3              4.1%                        5.2%                     11.4%                         41.3%
6   1986                 754.4                      13309.1                     655.1                          11556.8              3.2%                        3.6%                     11.3%                         41.3%
7   1987                 814.8                      14343.4                     813.0                          14312.5              5.4%                        4.1%                     10.4%                         39.3%
8   1988                 891.8                      15669.5                     989.6                          17386.5              5.7%                        4.6%                      8.6%                         37.1%
9   1989                 950.7                      16656.6                    1007.4                          17650.7              2.6%                        5.2%                      7.2%                         32.5%
10  1990                 993.5                      17357.8                    1193.7                          20854.9              0.7%                        7.0%                      7.1%                         28.6%
11  1991                1015.8                      17684.8                    1247.9                          21725.3             -1.1%                        7.5%                      8.9%                         28.5%
12  1992                1043.1                      18114.3                    1289.1                          22385.4              0.4%                        4.2%                     10.0%                         33.2%
13  1993                1094.4                      18962.9                    1154.1                          19997.2              2.5%                        2.5%                     10.4%                         38.0%
14  1994                1160.8                      20061.4                    1239.7                          21425.4              3.8%                        2.0%                      9.5%                         40.8%
15  1995                1215.1                      20941.6                    1341.9                          23125.9              2.5%                        2.6%                      8.6%                         43.7%
16  1996                1268.2                      21804.1                    1416.8                          24359.4              2.5%                        2.4%                      8.1%                         43.8%
17  1997                1354.3                      23223.7                    1558.5                          26726.5              5.0%                        1.8%                      7.0%                         43.3%
18  1998                1420.3                      24288.9                    1651.8                          28247.5              3.7%                        1.6%                      6.3%                         40.9%
19  1999                1487.8                      25352.1                    1682.6                          28672.0              3.3%                        1.3%                      6.0%                         39.4%
20  2000                1574.8                      26743.5                    1661.3                          28212.7              3.5%                        0.8%                      5.5%                         36.7%
21  2001                1654.3                      27984.7                    1639.1                          27727.9              2.7%                        1.2%                      5.1%                         34.0%
22  2002                1716.7                      28916.4                    1782.9                          30032.2              2.2%                        1.3%                      5.2%                         34.2%
23  2003                1808.7                      30328.2                    2052.8                          34421.4              3.3%                        1.4%                      5.0%                         35.4%
24  2004                1899.7                      31688.0                    2413.1                          40252.4              2.3%                        1.3%                      4.8%                         38.4%
25  2005                2017.2                      33389.8                    2535.6                          41971.8              3.0%                        2.1%                      4.8%                         39.6%
26  2006                2135.4                      35106.7                    2709.8                          44549.7              2.7%                        2.3%                      5.4%                         40.5%
27  2007                2244.9                      36610.5                    3094.6                          50467.1              2.4%                        2.3%                      5.4%                         41.5%
28  2008                2281.6                      36904.6                    2952.3                          47753.7             -0.3%                        3.6%                      5.7%                         49.3%
29  2009                2201.7                      35363.2                    2421.0                          38885.1             -4.1%                        2.2%                      7.6%                         63.2%
30  2010                2274.4                      36240.1                    2484.0                          39579.6              2.1%                        3.3%                      7.9%                         74.3%
31  2011                2351.3                      37154.2                    2660.8                          42044.6              1.3%                        4.5%                      8.1%                         80.0%
32  2012                2439.7                      38296.7                    2704.5                          42453.5              1.4%                        2.8%                      8.0%                         83.2%
33  2013                2560.7                      39945.1                    2785.1                          43444.6              2.2%                        2.6%                      7.6%                         84.2%
34  2014                2667.4                      41292.5                    3067.1                          47480.8              2.9%                        1.5%                      6.2%                         86.1%
35  2015                2772.6                      42583.1                    2933.4                          45053.5              2.4%                        0.0%                      5.4%                         86.7%
36  2016                2896.5                      44121.1                    2703.2                          41177.8              1.7%                        0.7%                      4.9%                         86.8%
37  2017                3032.8                      45923.4                    2664.7                          40349.9              1.7%                        2.7%                      4.4%                         86.3%
38  2018                3144.1                      47325.6                    2861.0                          43063.7              1.3%                        2.5%                      4.1%                         85.8%
39  2019                3246.3                      48599.0                    2833.3                          42416.6              1.4%                        1.8%                      3.8%                         85.2%
40  2020                2961.9                      44153.9                    2709.7                          40394.1             -9.8%                        0.9%                      4.5%                        104.5%

3. JSON Normalised

Sometimes when you are working with JSON data, you’ll find that it’s quite common for a JSON object to feature many nested objects. For a JSON object to be converted into tabular data, they need to be flatted (has a nested depth of one).

Conveniently, pandas provide a way to do this using the pd.json_normalize() method.

For example, consider the following JSON object (as used as an example in the pandas documentation)

>>> data = [
    {
        "id": 1,
        "name": "Cole Volk",
        "fitness": {"height": 130, "weight": 60},
    },
    {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
    {
        "id": 2,
        "name": "Faye Raker",
        "fitness": {"height": 130, "weight": 60},
    },
]

It features attributes which contain nested objects (“fitness” in this case). Using the pd.json_normalize(), this can be collapsed into table. Note: an optional parameter of max_level can be added to specify the maximum number of nested levels to collapse. By default, it will normalise all levels.

>>> df = pd.json_normalize(data)
>>> df
    id        name  fitness.height  fitness.weight
0  1.0   Cole Volk             130              60
1  NaN    Mark Reg             130              60
2  2.0  Faye Raker             130              60

4. Import from Clipboard

This is a rather simple one but let’s suppose you are copying and pasting data from a website into a Word document. The pd.read_clipboard() method can be used to capture whatever data is stored on your clipboard.

By default, this method accepts a regular expression of \\s+ as a delimiter for separating values. Alternatively, you can use your own regular expression.

>>> df = pd.read_clipboard(sep='\\s+')

This is especially handy as it saves you from writing the data by hand meaning that you’re less prone to making mistakes if it was done manually.

5. Import from Excel

Let’s be honest, everyone has used Microsoft Excel (other spreadsheet software is available) at some point in their lives. If I’m honest, I still use it a lot and it remains one of the most popular ways of sending tabulated data to others.

Let’s suppose that someone has sent some data to you that is stored in an excel spreadsheet, the pd.read_excel() method can be used to read the data as if it were a simple CSV file.

One thing to bear in mind is that spreadsheet software allows users to segment multiple spreadsheets using sheets within a single file. Accessing data from a specific sheet can be achieved using the sheet_name parameter.

>>> df = pd.read_excel('my_spreadsheet.xlsx', sheet_name='Sheet 1')

Conclusions

In summary, this blog post features some of the most useful panders shortcuts I use daily. Let me know if I’m missing any obvious ones as I’m sure there are many more than the ones I’ve listed here. Overall, panders is an incredibly useful Python package and it has so many different features for manipulating, reading and writing data.

comments powered by Disqus