Want to learn more? Take the full course at at your own pace. More than a video, you’ll learn hands-on coding & quickly apply skills to your daily work.
We’ve seen that available memory & storage restricts datasets that can be analyzed. A common strategy is to subdivide datasets into smaller parts.
We’ll use a 200,000-line file summarizing New York City cab rides from the first two weeks of 2013. Then, using read_csv() with the parameter chunksize=50000, the function returns an object we can iterate over. The loop variable `chunk` takes on the values of four DataFrames in succession, each having 50,000 lines except the last (because the first line in the file is the header line).
The loop variable chunk has standard DataFrame attributes like shape. So the last chunk has almost 50,000 rows & 14 columns. Calling the info() method shows the column names like trip_time_in_secs & trip_distance.
We can construct a logical Series is_long_trip that is True wherever the trip time exceeds 1200 seconds (or 20 minutes). Recall we can use the dot-loc accessor with the boolean Series is_long_trip to filter rows where this condition holds. The shape attribute reveals about 5,500 taxi-rides longer than 20 minutes in duration in this chunk of 50,000 trips.
Let’s embed this filtering logic within a function filter_is_long_trip that accepts a DataFrame as input & returns a DataFrame whose rows correspond to trips over 20 minutes. Next, we make a list of DataFrames called chunks by iterating over the output of read_csv, this time using chunks of 1,000 lines. Rather than initializing an empty list chunks and appending elements within a loop, we can also use a list comprehension to build the list. Remember, this list comprehension is equivalent to the preceding for loop. In both cases, each chunk is filtered as it is read from disk.
We can use another list comprehension called lengths to see that the dataframes in the list chunks each have around 100 to 200 rows (rather than 1,000 rows in the unfiltered chunks). The Pandas function pd.concat() accepts this list of DataFrames with common column labels
and stacks them vertically. The resulting DataFrame long_trips_df has almost 22,000 rows (much fewer than the original 200,000).
Finally, we can visualize these trips; the result looks something like this.
We generate the last plot with this code. We start by importimg matplotlib dot pyplot, & constructing a scatter plot using plot dot scatter. We apply labels and display the plot with plot dot show. Remember, we used filtering or logical indexing to extract the small subset of relevant data in manageable chunks; the entire dataset was never in memory at one time.
Take some time now to practice reading & filtering files in chunks in the exercises.
#PythonTutorial #Python #DataCamp #BigData #parallelprogramming #dask #Data #Chunks