Sum duplicated rows on a multi-index pandas dataframe

Hello I’m having troubles dealing with Pandas. I’m trying to sum duplicated rows on a multiindex Dataframe.
I tryed with df.groupby(level=[0,1]).sum() , also with df.stack().reset_index().groupby(['year', 'product']).sum() and some others, but I cannot get it to work.
I’d also like to add every unique product for each given year and give them a 0 value if they weren’t listed.

Example: dataframe with multi-index and 3 different products (A,B,C):

                  volume1    volume2
year   product
2010   A          10         12
       A          7          3
       B          7          7
2011   A          10         10
       B          7          6
       C          5          5

Expected output : if there are duplicated products for a given year then we sum them.
If one of the products isnt listed for a year, we create a new row full of 0.

                  volume1     volume2
year   product
2010   A          17          15
       B          7           7
       C          0           0
2011   A          10          10
       B          7           6
       C          5           5

Any idea ? Thanks

Bokeh cannot change datetime axis format

I’m following different codes for how to display different datetime formats on the x axis but for some reason the axis is always in the format mmmyy like Jan17, no matter what I put into DatetimeTickFormatter. How can I change the format, for example Jan 15, 2017?

    p=figure(plot_width=800,plot_height=500)
    p.line(x="ENTRYDATE",y="Transactions",color='LightSlateGrey', source=sourceDay)
    p.xaxis.major_label_orientation=1.5
    p.xaxis.formatter=DatetimeTickFormatter(days=["%a\n%d %b"])

The ColumnDataSource is in the form:

ENTRYDATE | Transactions
2017-01-15  29
2017-01-20  30
..
2018-01-03  1

Where ENTRYDATE is a datetime object. Thanks in advance.

How to open a large CSV file with different number of columns in pandas?

I’m trying to open a large CSV file with pandas but the file has different column sizes on each row. The whole file is comma-separated, but on the empty rows, even the commas are missing.

Example:

A,B,C
A,B,C,D,E,F,G
A,B,C
A,B,C
A,B,C,D,E

What is best practice to open this file?
I’ve had no luck opening more than the first 7M lines.
This is how I open it at the moment:

pd.read_csv(’file.csv', encoding = "ISO-8859-1", error_bad_lines=False)

I’ve also tried naming the columns, but I don’t know how many columns there exists in the file.
I’ve tried chunking as well since I got memory warnings in the beginning, but I highly doubt that is the problem. I’ve opened much larger files earlier without any problems. The whole file is around 4GB, though I don’t know how many lines in total.

Find first row with condition after each row satisfying another condition

in pandas I have the following data frame:

a b
0 0
1 1
2 1
0 0
1 0
2 1

Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:

a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1

Thanks!

Groupby rows, join and selection between a NaN value and a number

I would like to transform this dataframe:

A   B   C   D
xx  nan 2   3
xx  1   nan 2
yy  4   nan nan
yy  nan 5   nan
yy  nan 4   6
zz  nan nan nan
zz  nan 8   nan
zz  nan 9   nan
zz  7   10  nan

into this final one (by grouping by ‘A’ column)

A   B   C   D
xx  1   2   3
yy  4   5   6
zz  7   8   nan

I tried to use groupby,fillna and apply but with no results so far.

I want the final dataframe to always show, wherever it is possible, a non-nan value and the first value for each unique ‘A’ column value.

How could I do this?

How to efficiently read rows from Google BigTable into a pandas DataFrame

Use case:

I am using Google BigTable to store counts like this:

  rowkey  |    columnfamily    |
|          | col1 | col2 | col3 |
|----------|------|------|------|
| row1     | 1    | 2    | 3    |
| row2     | 2    | 4    | 8    |
| row3     | 3    | 3    | 3    

I want to read all rows for a given range of row keys (let’s assume all in this case) and aggregate the values per column.

A naive implementation would query the rows and iterate over the rows while aggregating the counts, like this:

from google.cloud.bigtable import Client

instance = Client(project='project').instance('my-instance')
table = instance.table('mytable')

col1_sum = 0
col2_sum = 0
col3_max = 0

table.read_rows()
row_data.consume_all()

for row in row_data.rows:
    col1_sum += int.from_bytes(row['columnfamily']['col1'.encode('utf-8')][0].value(), byteorder='big')
    col2_sum += int.from_bytes(row['columnfamily']['col2'.encode('utf-8')][0].value(), byteorder='big')
    col3_value = int.from_bytes(row['columnfamily']['col3'.encode('utf-8')][0].value(), byteorder='big')
    col3_max = col3_value if col3_value > col3_max else col3_max

Question:

Is there a way to efficiently load the resulting rows in a pandas DataFrame and leverage pandas performance to do the aggregation?

I would like to avoid the for loop for computing the aggregates as it is known to be very inefficient.

I am aware of the Apache Arrow project and its python bindings and although HBase is mentioned as a backing project (and Google BigTable is advertised as being very similar to HBase) I can’t seem to find a way to use it for the use case I described here.

Padding pandas groupby with zeros for inconsistent date ranges

I have car rental booking data, of this form:

location  |  day_of_drive |  day_of_reservation  |  number_of_bookings
-------------------------------------------------------------------
foo       |  01-01-2015   |  24-12-2014          |  1
foo       |  01-01-2015   |  26-12-2014          |  1
foo       |  01-01-2015   |  29-12-2014          |  3
foo       |  01-01-2015   |  30-12-2014          |  2
foo       |  01-01-2015   |  31-12-2014          |  1
foo       |  02-01-2015   |  29-12-2014          |  2
foo       |  02-01-2015   |  31-12-2014          |  1
foo       |  02-01-2015   |  01-01-2015          |  1
bar       |  25-06-2016   |  03-07-2016          |  1
.
.
.

What I want is to pad this dataset to include dates where 0 bookings were made.

The data frame is already sorted, first by location, then by day_of_drive, then by day_of_reservation.

What I want is an efficient way to pad with zero values the day_of_reservation variable between the first observed value of this variable, which corresponds to the first booking for this day_of_drive/location pair, and the day_of_drive , for a given pair of location and day_of_drive itself. I have come across solutions using unstacking, then using fillna(0), then stacking back up, but I don’t think I can use these as each pair of location and day_of_drive has a different min and max date I want to pad between.

Desired output:

location  |  day_of_drive |  day_of_reservation  |  number_of_bookings
-------------------------------------------------------------------
foo       |  01-01-2015   |  24-12-2014          |  1
foo       |  01-01-2015   |  25-12-2014          |  0
foo       |  01-01-2015   |  26-12-2014          |  1
foo       |  01-01-2015   |  27-12-2014          |  0
foo       |  01-01-2015   |  28-12-2014          |  0
foo       |  01-01-2015   |  29-12-2014          |  3
foo       |  01-01-2015   |  30-12-2014          |  2
foo       |  01-01-2015   |  31-12-2014          |  1
foo       |  02-01-2015   |  29-12-2014          |  2
foo       |  02-01-2015   |  30-12-2014          |  0
foo       |  02-01-2015   |  31-12-2014          |  1
foo       |  02-01-2015   |  01-01-2015          |  1
bar       |  25-06-2016   |  03-07-2016          |  1
bar       |  25-06-2016   |  04-07-2016          |  0
.
.
.

I have a feeling the solution is with a groupby.

Using .loc in pandas slows down calculation

I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the “.loc” notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.

df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)

Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?

Converting key-values of python dictionary into pandas dataframe

I’ve a python dictionary with single or multiple integer values as string like :

d = {'a': ['1.20', '1', '1.10'], 'b': ['5.800', '1', '2.000'], 'c': ['9.5000', '0.9000'], 'h': ['1.90000', '6.100000'], 'l': ['1.0000', '8.00000'], 'o': '5.0000', 'p': ['3.00', '1.1000'],  'v': ['1.8', '0.0000']}

How to convert it into a pandas DataFrame without taking help of pandas series?

Expected Output :

            col1  col2  col3
       a    1.2   1     1.1
       b    5.8   1     2
       c    9.5   0.9   NaN
       h    1.9   6.1   NaN
       l    1     8     NaN
       o    5     NaN   NaN
       p    3     1.1   NaN
       v    1.8   0     NaN