Deep Learning is severely overrated!

If I am not working in this field, and work as a general tech guy in a tech company, I would have been overwhelmed by this trend as well, seriously. While the world is promoting AI (specifically, tech companies), few people really understand the techniques that are in the center of play.

Machine learning, AKA modeling, has been in the field for a much longer period, and its base on mathematics and statistics has made it a very powerful tool for statisticians and engineers to train computers to help their business. Deep learning, which is signified by the development of computing power in neural network models, has become the hottest topic in recent years, followed by the successful stories of AlphaGo which has beaten one of the best Go players in human history.

If you look closely, though, or if you work as a data scientist like I do in one of those “big” tech corporations, you would soon realize that deep learning can quite often give you worse results in reality. In other words, deep learning is not for everyone in every situation. Neural networks have been great in three specific fields. Firstly, it is an excellent tool for computer vision. The development of new network structures such as convolutional neural network has transformed the way machines see pictures, thus giving it a pretty decent accuracies in fields like object detection and image recognition. Further more, text analysis, such as machine translation and word prediction have been enhanced by recurrent neural networks, in which the structure can remember previous occurrences for a specific event. Lastly, reinforcement learning(which is basically machine learning new things by exploration or exploitation) has seen its biggest enhancement followed by the help from deep learning. AlphaGo uses a more complex type of this network setting to succeed in overcoming the difficulties to defeat human top-notch Go players.

However, if you are in a traditional field such as anti-fraud, and you have about 20 features with slightly over 100,000 observations, you would be amazed by the fact that as simple as a model like logistic regression can serve you better. Theoretically, though, deep learning has the power to imitate any linear or non-linear models, but setting the hyper parameters just about right is an art instead of science. Quite often, at least to me what happened is tree models(gradient boosting machines, random forest) or linear models(logistic regression, elastic net regression) have better predictive power and easier to be interpreted. Does it mean some mistakes in my deep learning experiments? I used to think so, until I realized that the inner drawback of deep learning: it can’t replace math and statistics in modeling! Especially when you are dealing with a highly imbalanced dataset, using deep learning models would easily make it overfitting or less predictive than math models, and this has been exemplified by some practices of mine in my daily work.

So don’t be fooled by any crazy promotions of AI. It does have change some fundamental ways for machines to learn new things, but it can’t guarantee you good results when it comes to modeling. A lot of companies are using this as a trick to attract new fundings, just like what Bitcoin has been to our world. You won’t assume there’s a Swiss knife for modeling, won’t you? LOL.

Chinese Internet Companies Are Starving For Money

Not long ago, world’s fourth biggest phone maker – Xiaomi, has successfully held its IPO in Hong Kong, marking this newly-rising Chinese giant company a new milestone in its journey. It is hard to categorize Xiaomi as an Internet company, like how Google is, when the primary profits of it still come from its low-end smart phones.

I work for Xiaomi as of now, and I am proud to see its rapid but huge development in different areas, although I am indeed not convinced by its decision to go public. But I also understand that was probably not a perfect timing for the board, either. China has tighten its control on financial area recently, and as visionary as a CEO could be, he should realize that missing now probably would cause missing the future. Therefore, it is not really a multiple choice question. Instead, all would tend to do it as soon as possible before everything else goes wrong in the bigger context.

We all know recently America has been impacting the world economy aggressively, with the latest example being Turkey getting crushed in its financial section. When people rush to Istanbul to purchase luxuries, they need to realize that thing could happen anywhere any soon around the globe. For Chinese Internet giants or startups, they could be witnessing an approaching storm in funding area very soon. For giants, they could just hang in there for several years and then make it till next spring; for startups or unicorns, it would be a different story, since they still need tons of funding to support their great ideas and visions.

How does the storm come anyway? I guess it can’t strip away the relation with the real estate market, which is worth trillions of dollars today. It has become the centerpiece of the underlying problems – the ever-growing real estate market has taken away all the investment that should have gone into other industries, thus creating probably the biggest bubble in financial history. Other than that, it has also brought a lot of corruption because the land is controlled by the state, so it means by selling more land, local governments can increase their profits in the financial sheets. There are some other problems as well, but none of them is quite comparable to the one we just mentioned, which has taken years to form. Internet companies, which have enjoyed rapid investments and funding, inevitably would be impacted by the real estate market. On the other hand, employees of those Internet companies, many of them are strongly talented, are starting to feel huge pressure in terms of apartment rentals. The rents in Beijing area have gone up by a growing rate, and many of my colleagues are under pressure with limited payments from the company. It drives, instead, the companies to seek more funding from possible sources.

Although some of the Chinese tech giants originally copied their business model from foreign companies, I have to admit that in terms of localization they are far better than the original idea-holders. The CEO of Baidu just mentioned that if Google wanted to come back to China again, he has the confidence to win again. That is not completely bullshit, because the vast majority of Chinese netizens are not what we believe they are.

Anyways, the economic conditions are still not yet clear, but it has reminded me to be careful with what decisions I make, especially for financial decisions. Obviously more Chinese Internet companies will go for IPO soon, but we all know it is not the signal that the economy is going well.

Value something, not somebody

People who have gone through a lot of ups and downs would like to convey this concept to the young: value something, not somebody. When I was young, I used to put all my goals and plans on the girl I “loved”, for instance, I would choose the place where she is to go, and I would start to learn the things she likes. Nonetheless, that is not all. When we are in a relationship, we would tend to be over caring because we are afraid of losing, so some of us keep giving love without any valid feedback, only to find out that by putting all he wants/likes/needs on somebody, he has already risked losing all he had.

People change, and it is the truth. But it is not a bad thing. As intelligent creatures, we human beings have well adapted to the rule of nature – competition. No matter we are cooks, servers, coders, athletes or what, we compete against others to show we are valuable in this world. Some people would have already found what they love for the lifetime, but some people don’t. I happened to be one of them when I was young, so I used to rely the decision making on my girlfriend’s wills – whichever you choose, I will follow your path. While some of this type of stories would end up in good romance, most of them don’t. When your decision maker left, your entire system broke down. You feel like a zombie with no actual living purpose, crawling around searching for food.

Life is not like that, for sure. Something is always true, and that should be the things that we are after, that we should value. By finding those valuable things, such as good habits, good manners or morales, you will find your life unbreakable by anyone else, because those things won’t break. Some of us want to be a travel journalist because they want to show the world wonders that are undiscovered; some of us want to be a splendid cook because they like the smile when customers take a bite on the well-made food; some of us want to be coders because they love the simple idea that can change the whole world.

Those things don’t break, and we should value them instead of some individuals. Think about the time when we fall in love with somebody, we love them as a whole, not just their body. The abstract part of love is what we often ignore when the hormone effect hits. If you love to do something, keep the motivation, and if you still haven’t found one, go find it. Just don’t put your life’s blueprint on any individual, and if there has to be one, it got to be yourself.

Be your god, and love your world first before you can care others’.

NBA > NFL

Let’s face it, NFL is getting more and more boring nowadays, and I am not saying football is getting there, but NFL, specifically. Two years ago I started to get really into college football, and now I can convincingly say: college football is the most fun sport in the world, with no exception.

NFL is making itself quite embarrassed recently, by announcing that it would give the team that has players kneel down during national anthem a stunning 15-yard penalty, before the game even starts. This sounds very familiar to me, because I have been through quite a lot of similar penalties since I was young here in China. Yes, I am talking about coxxxnism. I wouldn’t say that this is not effective, instead, it is probably the most impactful policy ever if you want to punish an individual: if you don’t listen to me, fine, cause I am going to threat your teammates or families.

I don’t understand why would NFL choose this type of penalties. Even so, I don’t understand why would there be a penalty before the game starts. I’d rather attribute to this joke to NFL’s incompetence during these years, because clearly basketball is gaining more and more ground.

This season, at this time, there’re two 2-2 finals in both eastern and western conference, making it an exciting year for basketball fans since the foundation of warriors dynasty. I just hope NBA could get better and more balance between teams, and keep treating players as their partners instead of their tools.

Before the opening of next season’s college football, let us enjoy the May/June Madness!

Python is reigning!

When I started to learn Spark back in 2016, Scala is the best option to write Spark programs due to its simplicity and quickness. Back then PySpark was also available to Python users, but in order to use it, people have to initiate a SparkContext at the very beginning, and then go through some tedious steps to set things up.

As time reaches to 2018, I find that Spark community has officially created a new way for Pythonistas to interact with Spark core. The savior in this context is called SparkSession, which has bundled up a lot of things that normal users may not care at all. With SparkSession, Python users can dive right deep into data exploration, just as what they like to do in terms of machine learning. Though Scala still owns the crown in writing Spark programs, PySpark is now catching up.

Similarly, when I reviewed my knowledge in TensorFlow, I also found that Google has provided two new high-level APIs apart from the original low-level APIs. This has greatly lower the cost in time and effort for Python users to get their hands on this great open-source deep learning framework.

Java and other low-level programming languages are probably still the best overall for those who care about performance. I didn’t expect that Python would gain such huge popularity if its community is not active in machine learning area, thanks to third party packages such as scikit-learn, pandas, numpy and others. There’s a good saying about this – when one rides with the wind, even a pig can fly.

Next Big Area for Data Analysis

Artificial Intelligence? No, since it is already very popular and everybody wants some out of it.

Block chain technology? Nope, since it is more of a security thing, and data analyst is not playing a main role in its development.

Then what are we talking about here? I want to define the “Big Area” in the title as something that we could easily tap into with our current techniques, and it has not gone viral globally yet. The area I’d like to share my insights on, is electronic gaming.

Yep people might say that sports analytics has already been maturing especially in developed countries, and top players have dedicated data analyst to improve their performance. However, they could afford them partially due to the global recognition of their type of sport, and esport hasn’t been at the stage yet – it is still struggling to make itself enter the vision for most people. That is to say, the data analytic part in esport is very much related to the industry itself, and I would say it has a prosperous future.

Additionally, esport is so perfect for data analysis that I couldn’t think of another sport that has more appropriate case for data analysis. As of now, we have adapted high speed computers and machines to do complex calculation for us, and at the same time esport is played on computers. In other words, we’ll be able to analyze anything you would care about in games in order to improve the performance, because all data is available theoretically.

Esport’s data analysis has inherited the advantages from traditional sports, for example, you can’t be a good analyst without the experience or knowledge of that specific sport. It has also some merits that traditional sports can’t achieve, such as the low difficulty of getting observable data. As the esport industry continues to grow with more competitive games such as Dota2, League of Legends and Overwatch emerging, I believe it has quite a bright future for those brave pioneers.

Raspberry Pi 3, OMG

I’ve never imagined there would be such a revolutionary product, such a tiny little computer which is even smaller than the floppy disk, come into the world when I am only in my 20s. I knew its existence probably 3 or 4 years ago, but I have no interest in buying one until recently, when I realized I might need a 24/7 computer that can run linux.

With $35 in 2017, you could buy a Raspberry Pi 3 with WiFi and bluetooth capabilities, 4 USB ports, 1 RJ45 port, 1 HDMI port and an old fashioned 3.5mm headphone jack. Its IO system is so well-rounded. It doesn’t have a good CPU nor GPU, and its RAM is quite limited by 1 Gb, but who needs that much, really?

I bought one because like I mentioned at the beginning, I need a linux machine that can run 24/7 to serve as a gateway of my local network. I learned this technique accidentally when I was browsing the method to improve the connection of game consoles. Yeah, you are right, my purpose to do all this is simply trying to enable me to play online games LoL. I have owned a Nintendo Switch for quite a long time now, and since I came back to China the Internet condition here is quite annoying, with strict NAT type everywhere and usual lack of a public IP. All those factors and complexities have made me unable to connect to other players on my Switch games.

I then turned my research on those game proxies that claimed to be able to speed up my connection to outside world. Basically they are just proxies or VPNs that can redirect UDP traffic so as to give me a public IP. On PCs you could easily do it by installing some software that can do it, but it is hard to do it for consoles since you can’t install software on them. Some people decide to pay extra money to get a smart router which is basically a router with a smarter system on which you can install some “software”, but in essence they are linux-like systems. This fact makes me wonder if I could just use a local machine to serve as a router, and I soon happened to find out there’s a technique called transparent gateway, which is just another machine in the local network that can serve as a gateway to redirect all Internet traffic to a outside server, and what I need to do is simply changing some IP settings in my Switch’s Internet setting.

I have some experience in Linux, partially on my Mac, but mainly on my VPS. However, this time I am dealing with Debian on the Raspberry Pi instead of CentOS on my VPS, and it has some different properties, but not too much. The process is hard, and due to “fear” I don’t want to talk too detailed in this post. I simply install a “software” that can redirect my traffic to a server outside that has a public IP. Firstly, the Raspberry Pi should be able to redirect both TCP and UDP packets, and this could be done by setting up some rules in iptables. Secondly, the server in the outside world, should be able to handle UDP relay as well, and this was a mistake I made during the process. So at the end it is very simple, one step locally and one remotely, but the trials and errors along the way could be daunting to a lot of people. Luckily this is not my case, because I want to solve this problem so badly that I spent hours after work until I can’t keep my eyes open. (shameful problem-solver personality)

Now that there’s a 24/7 smart “router” in my house, I feel like there’re more opportunities for a smarter house. I remember I brought an Apple TV 4 back from the states, but couldn’t make it work because back then I didn’t have a machine that can do traffic relay, so my Apple TV is terribly crippled by the Internet in China. Proudly I am going to make it back to work when I am home in the upcoming Spring Festival, with my cutie Raspberry Pi, and believe it or not, it only costs me $35.

Please Hold On To Net Neutrality, America

It might sound weird to hear from a Chinese guy shouting out for American issues at first, but if you understand the current circumstances of Chinese Internet condition, or if you have ever lived here, you’ll realize right away what I am trying to say. IT WILL BE A STEP BACK, I PROMISE YOU.

I seriously believe this is a hot tech topic in the U.S. now, but as you could even imagine, there’s nearly no coverage for this piece of news here, partially thanks to the already-gone net neutrality here. ISPs should never be granted the rights to differentiate their customers, and I’ll use examples here to tell you what is going to happen.

To start with, what you are worried about is going to happen: charging more fees for heavier users, bundling up some websites to segment the market and so on. In China there’re as many types of packages as you could imagine for company network, especially if it is a foreign company that needs more open Internet, it would be charged with more fees with customized service. And this is totally unnecessary if net neutrality is still in the play. Internet is innovated as a motivation to connect across the globe, although ISPs could arrange the resource more effectively by shutting down net neutrality, it violated the basic ethics for Americans: every one is born equal.

Secondly, if the government has the power to abolish any policies without consulting the majority of the tech society, what could happen in the future? In the future, ISPs might not only be able to bundle the service they like to bundle, but also be able to censor your data as they like. Moreover, the government might also step in and say: hey now we are in charge, so your data will be sent to U.S. government before it leaves the U.S. territory. As so basically it has the potential of granting the government too much authority on this topic, which might make a lot of us feel violated.

It reminds me of the death of Aaron Swartz, who challenged the copyright world with his programming skills and sincere motivations. It also reminds me of people in Wikileaks, Pirate Bay, Anonymous… The values and the goals they are promoting to the world is shockingly similar: Internet/Knowledge should not be only for the rich, for celebrities, or for people who have authorities.

One final question to all of you: how could Donald Trump ever become the POTUS if the poll only serves for those who are “louder”? Please don’t lose this core value behind even if you are planning to turn you back on net neutrality.

A Visualization For Score-Cutoff-Like Strategies

It’s been three straight months since I started my work at HomeCredit China, and I have to admit that companies like this big really take new employees a lot of time to understand their business model at the beginning. But thank god, time and intelligence can together help me to overcome the difficulties, so now I am starting to get the hang of it, and I am going to share some of the techniques I found out in my work that can help others, potentially.

Working in the decision-making department that controls the underwriting strategies for loan applications, quite often we would use “scores”. They could be scores calculated by the company it self, or the score from other companies like the popular “Zhima Score” from Alipay in China, or they could be a little bit of either. For example, if we were Alipay, and we are gonna say in order for us to credit you some money, you’d better have a Zhima Score that is higher than 590. One day you found out that this decision is rejecting more customers than you would hope for, you’ll naturally want to adjust the cutoff value, right? Therefore you’ll start to simulate the new cutoff value, for example, 580, to see if this new strategy is doing you any better. Simulation is also important but I’m not gonna talk about it here.

However, what I just said is just for the most simple case, where you only have one score cutoff, so that you can just rerun the simulation again and again to draw a graph where the x-axis shows you the cutoff value and the y-axis shows you the approval rate, for instance. What if, you have a lot of score cutoff strategies? You can still do the analysis one by one by rerunning the simulation by different cutoff combinations, but when the data for simulation is large this process can become crazy slow, and you’ll not be able to see their interactions at first glance.

What I am proposing here is, to use an “incremental” method for visualization instead of the “all-over-again” method. Let’s get the simulated data once, and then do the modifications on it. For instance, when you want to lower the Zhima Score cutoff, try to find out what part of customers in the dataset will be possibly getting the offer, and then assign a possibility of them being approved, such as 80% or any other value calculated. Correspondingly you can apply this modifications to all other cutoffs, and because the strategies in the decision-making system are usually run in an order, and are coded in a table with each line referring to one strategy, you could break down the specific segment of customers who might get new results, calculated by probability. It is not accurate at the first glance, but when your data is large, it could be more precise than you would’ve thought.

And the huge upside of this method is, you could visualize the interactions among several score cutoff categories AT THE SAME TIME! You can play with the modifications and see their general impact on the variable you’re interested in. In my example here you can play with the sliders for different score cutoffs, and see what they would do, how large the impact would be for the approval rate and risk performance. At the very end, when you find out the perfect combinations you want, you could simply rerun the simulation to prove this thought.

And trust me, the result won’t be too far away from the visualization, because you already know what is going to happen through the visualization.

Project link here: Github Link

Signing up the Dataquest!

Since I started to work, the things to do have mainly become getting data from database using SQL, and then throw it into Excel to do some analysis – to be honest, Excel is enough for most analysis for most companies. Although in fact my supervisor has never constrained his employees what kind of tools to use, most of my colleagues have settled down on Oracle SQL + Excel.

This is not the case for me, never. As one of the only two people who can code in my team, I started to automate some processes (especially some annoying and repetitive ones) using Python (thank god I have learnt Python myself) with another colleague who is from computer science background. It is pretty funny when he also learns the fact that I could even code in HTML and CSS, and I believe he would never underestimate a business student’s coding skills anymore. Another interesting thing about my job is that one lousy colleague always keeps asking questions about R programming language to me. One day when I tried to asked her some very basic operational questions, she was pissed off for no reasons and shouted “why don’t you check it up yourself”. After that I’ve never talked with this idiot who thinks she is the center of universe and have rejected all possible chances for conversation. Please, Chinese girls, be respectful to others, will you?

Alright, back to the topic about continued learning after graduation. Apart from my learning Japanese everyday, I sincerely believe that I need to keep learning some professional skills to make myself competitive for the next couple of years. Tensorflow is what some others are learning, and I took a look at it. It is a very promising package from Google but, it is too cumbersome for learning and coding. Not long after Keras bumped into my vision, and I think it will be a perfect package for deep learning beginners. After doing some research online, I happened to find out two amazing websites for people who want to learn more about data science: Datacamp and Dataquest.

I really want to try both at the same time, but they both require subscription to be able to access to all available courses. Datacamp is more R-focused with growing Python content, while Dataquest is more Python-focused with clear paths to be an expert in data analysis in Python. No brainer, isn’t it? As a guy who has learned R extensively in the past two years, I’ll be more than happy to have a website who specializes in teaching people Python to do data analysis, so I ended up starting my learning in Dataquest. I might be signing up to Datacamp later as well, but it depends on how fast I can go through the content on Dataquest (I am learning like flying because I want to save some money).

No matter which service I chose, my ultimate goal is to get my hands on deep learning after gaining some machine learning knowledge from USC. I believe being able to know it as well as to do it, will greatly enlarge the boundary of who I can become. I just want to be better by learning things that interest me, and I know I will.

1 2 3 4