The popular Python library for data analysis and data science

Pandas is a popular Python library. This article describes some features and functions available in this library and encourages readers to use it for practical business problems.

Pandas provides fundamental, high-level building blocks for practical real-world data analysis in Python. It is one of the most powerful and flexible open source tools for data analysis and manipulation, and provides data structures for modeling and manipulating tabular data (data in rows and columns).

Pandas has two main data structures. The first is a “serial” data structure that helps retrieve data from the Python array or object dictionary. Data can be retrieved by position or by specifying the index name. The second is the ‘dataframes’ data structure for storing data in rows and columns. Columns have column names and rows are accessible using indexes. Columns can have different data types, including lists, dictionaries, pandas series, other database, NumPy arrays, etc.

Processing various file types

Data is often available in various formats. It is imperative that the tool used for data analysis is able to provide a wide range of methods to process them.

With Pandas, you can read different types of files like CSV, JSON, XML, Parquet, SQL (see Table 1).

To write Lily
CSV to_csv read_csv
JSON to_json Read_json
parquet to_parquet read_parquet
SQL to_sql read_sql, read_sql_query, read_sql_table
XML to_xml read_xml

Data cleaning using Pandas

In real-world scenarios, data is often incomplete and includes bad data. It is sometimes doubled. Additionally, the data includes sensitive and confidential information, which should be hidden. Pandas offers ways to handle bad data using methods like cleaning, deleting, replacing, hiding, etc.

a. Empty lines can be deleted with the df.dropna(inplace=True) operation.

b. Empty values ​​can be replaced by df.fillna(inplace=True). We can specify the column name to place in a particular column.

vs. You can hide sensitive and non-public data values ​​for all items NOT satisfying the condition my_list.where(my_list . Hiding values ​​satisfying the condition can be done with my_list.mask(my_list .

D. Duplicates can be removed from a data frame using:

df.drop_duplicates(‘’, keep = False)
df.drop_duplicates(‘’, keep = ‘first’)
df.drop_duplicates(‘’, keep = ‘last’)

Data analysis using Pandas

Table 2 lists the various Pandas functions that perform data analysis along with the syntax for their use. (Note: df stands for dataframe.)

Function The description Syntax
Head Head() the function returns the first five rows df.head(x)
tail tail() the function returns the last five rows by default df.queue(x)
Loc Loc The function returns a particular row. Data slicing is also possible loc(x:y)
By group Groups data on a particular column groupby(‘‘)
Sum Sum of values ​​in a particular column df[‘column’].sum()
Mean Average of values ​​in a particular column df[‘column’]. mean()
Minimum Minimum value in a particular column df[‘column’].min()
Max Maximum value in a particular column df[‘column’].max()
To sort Sort the dataframe into a column df.sort_values([‘column’])
Cut Column rows df.size
Describe Describes data frame details df.describe
Cross tabulation Creates a row and column frequency tabulation pd.crosstab(df[‘column1’]df[‘column2’]margins = true)
Duplicated Return True Where Fake based on duplicate values ​​in Column1 and Column2 df.duplicate([column1, ‘column2’])

Advantages of pandas

  • It supports multi-index (hierarchical index) used for easy analysis of data having a large number of dimensions.
  • It supports the creation of pivot tables, stacking and unstacking operations.
  • Categorical data containing finite values ​​can be processed with Pandas.
  • It supports grouping and aggregations.
  • Sorting can be explicitly disabled.
  • It supports both row-level (gets rows that satisfy the filter condition) and column-level (selects only required columns) filtering.
  • Helps reshape datasets. You can also transpose array values ​​and convert them to a list. When processing data using Python, you can convert the Pandas dataframe into a multidimensional NumPy array. the values ​​member variable is used for this.
  • Supports label-oriented data slicing.

The disadvantages

Pandas code and syntax is different from Python, resulting in a steep learning curve for some users. Also, some concepts like three-dimensional data are better handled in other libraries like NumPy.

Pandas really elevates the data analysis process in an efficient way. Its compatibility with other libraries makes it very suitable for use in various scenarios.

Comments are closed.