This article is the full version of the interview with analyticsindiamag.com.
Q: Tell us a bit about your educational background
From 2002 to 2004, I served in a Russian Spetsnaz. I have a medal for military valor. My service was an excellent opportunity to master leadership, ownership, communication, and more applied skills like using AK-47, sniper rifle, or throwing knives.
In 2010 I got a Masters in theoretical Physics at Saint Petersburg State University. The president of Russia, Vladimir Putin, graduated from the same University 35 years earlier.
In September 2010, I moved to California and started education at UC Davis.
In summer 2015, I got a Ph.D. in Physics and moved to Silicon Valley.
Q: Tell us about your role at your current company?
I work in a company called Lyft. Lyft is a ride-sharing company that operates in the United States and Canada. In addition to the central business, we have a division that works on Autonomous vehicles. It is called Lyft, Level5, and that is where I work.
My title is Sr. Software Engineer. I have to develop fast and robust machine learning models and make sure they go to production.
In addition to my main projects, I help with other initiatives where my skill set can bring value.
Last year my experience in machine learning competitions became useful. Lyft wanted to organize the challenge at Kaggle, and I helped to make this happen. This year we have another one, and, again, I am one of the organizers.
Q: How did your journey in machine learning begin? Your fascination with algorithms. How did it all start?
I was excited by natural sciences like physics, biology, and chemistry. My curiosity about the world and not money or career opportunities was why I studied physics and not computer science or sociology.
In winter 2015, I was deciding what I will do after Ph.D. I looked into options of:
- Staying in academia as a postdoc.
- Moving to industry for software engineering. I did not like either of them, but my friend told me that there is a third option: Data Science.
I attended a lecture where the presenter talked about Data Science as about the 4th paradigm of scientific discovery.
From birth, we learn how to understand the world through trial and error.
University taught me theoretical science and the method of numerical simulations.
Mastering a new paradigm felt like seeing the world in more dimensions, and it was the turning point.
I was excited, but it is only the first step. Data Science is a buzzword that includes 100500 different things. The information on the internet was not helpful.
To get some structure, I started taking a Data Science course at Coursera. In one of them, the lecturer mentioned Kaggle as an excellent place to practice machine learning skills. I joined my first competition, and that is how my long journey started.
Q: What books and other resources have you used in your journey?
The only book about machine learning that I read was Deep Learning by Ian Goodfellow. I read numerous blog posts and papers about ML. I love to watch presentations from the winners of the machine learning competitions and read their solutions. There is a lot of value there. My primary resource to improve data science knowledge is the Russian speaking community ODS.AI.. Thousands of data scientists from all over the world, industry experts, researchers, and kagglers. For every data science-related question that I have, there is someone who knows the topic and can educate me.
Q: What were the initial challenges? How did you address them?
The first problem was that I never had a formal Machine Learning education. I never had a mentor or a group of friends that are interested in a topic.
I was studying machine learning by myself, absorbing the information from the internet. My physics department, and my first two jobs, did not have ML people except me. It was concerning, but after I joined ods.ai, the problem was solved.
The second challenge is that I did not have lines in the resume that show my ML expertise. I did not have a Data Science industry experience or relevant papers. Recruiters ignored my resume.
After I got my first job, the situation changed. The instant I wrote at LinkedIn that I got a position, recruiters wanted to talk to me about some other opportunities. And the problem with the lack of the articles I addressed, writing a few in free time.
The third challenge comes from the fact that there is a gap between machine learning in academia, machine learning competitions, and in industry.
In industry, you need machine learning and strong software engineering skills. I did not have them at the beginning of my journey. I was writing code in graduate school, and it worked, I published papers, wrote a thesis, and graduated. But the code was inefficient, hard to read, hard to maintain.
One of the things that I like about technology companies is that tech is the company’s core. People know that to build a competitive scalable business, you need to have a good codebase and employees who write high-quality code. If you have a high-quality code, it does not mean that your company will do well, but most likely, if it does not, technical debt will kill it.
Hence there are processes like code review, continuous integrations, mentorship from more skilled colleagues. Besides, I believe that most of the experienced programmers and Data Scientists read Clean Code, Clean Coder, The Pragramatic Programmer, Refactoring: Improving the Design of Existing Code, and Designing Data-Intensive Applications.
Q: What has drawn you towards Kaggle?
Initially, I joined Kaggle by accident, the speaker in the online class mentioned the platform, and I gave it a try. I liked the idea of the forum and real-time leaderboard. Kaggle is an excellent platform with exciting tasks and a friendly community.
Probably, the main factor was that I was able to perform well. It took me some time to get to the top, but the fact that I was doing better in every new competition was extremely motivating.
Q: What does it take to be at the top? (your Kaggle journey)
On a side note, I would like to recommend the book Mastery by Robert Greene that talks about the similarities in the path to the top by Leonardo Da Vinci, Albert Einstein, and other people with well-known achievements. I liked it a lot.
I would like to say that there are many parallels between competitive machine learning and professional sports. Kaggle is an excellent place to develop your Machine Learning muscles. Think about top kagglers as about the “Powerlifters of machine learning.”
To be successful in sports, you need:
- Background in something relevant to this sport.
- A lot of hard work.
- Good mentor and teammates.
Kaggle is similar.
- The majority of Kaggle Masters and Grandmasters were studying something technical in the Universities: math, physics, computer science. The only exception that I can remember is Evgeny Patekha, who has a major in economics.
- The world of competitive machine learning is competitive. The number of participants is huge, the number of top places is limited. You compete with those who are skilled, have more hardware, study the topic of the competition for many years in the University, or make money for living with it. To be better than others, you need to change the way you think, the way you study, how you write your code, and how you deal with failures. Your final placing is directly related to the number of ideas that you check during the competition. The main difference between top kaggler and the new one is that the more experience you have, the better ideas you pick for the next experiment.
- It is possible to progress on your own. I got my first gold medal at Kaggle and the title of the Master by myself. It is not the most effective way. The earlier you will find a group of people who are passionate about machine learning and do a lot of ML, the better.
Q: How do you tackle a competition or any data science problem— Your routine/tips/tricks, etc
The most important thing that you need to do in the beginning, which could be an ML competition or ML problem at work, is to build an end2end pipeline that maps the data into a cross-validation score.
It can be hackish, the code could be of bad quality, but such a pipeline will unveil issues with the data, hardware, or models that you would never guess.
The approach is described in the classic book The Lean Startup, by Eric Ries.
After you have a pipeline that maps the data to cross-validation score and the leaderboard score, you need to improve your pipeline.
The next step is - new ideas. You can get them from the literature, Kaggle forum, your rich imagination, or any other source.
It was fine in the previous step to have a low-quality code. At this step, you will fix it. It is the time of massive refactoring. Your code should become more modular with every idea you implement.
The good thing is that many competitions are similar to each other, and the pipeline that you build on one will be used in the later ones.
You will have plenty of ideas to check for some time, but there will be a moment when you will feel empty. At this point, you will understand the problem really well. It could be time to look for teammates. The standard method is to look at the leaderboard for people that have similar standing.
I do not recommend forming a team too early. If you start a team with a person who did not make a submission themself. Most likely, they would overestimate their excitement and skillset. They will stop being engaged with the problem in the middle of the competition. And there is no way to kick such a person off the team once it was formed.
Q: What fascinates you about Kaggle and its community in general?
I like the scale at which Kaggle operates. If you look at any competition that is held as a part of the academic conference, you will see only 10 teams at the leaderboard. At Kaggle, you will have hundreds and most likely thousands of participants in each challenge.
Another thing that I like is the atmosphere of collaboration. People share code in kernels, and information about their approaches at the forum. The tradition to describe your winning solution at the Kaggle forum in detail was not enforced; it was born within a community. The next step is to get this information from Kaggle forums to the broader audience, for example, in blog posts or scientific articles. Some people already do it, but it can be a much bigger thing.
Q: What machine learning tools/frameworks/libraries do you use frequently? (language/algorithms/cloud service etc). Please feel free to elaborate.
My main deep learning framework in Pytorch, I use Catalyst and Pytorch Lightning for training. For image augmentations I use Albumentations.
For the hardware:
I have a good desktop at home: GPU: 4x2080Ti AMD Ryzen Threadripper 3970x 128Gb Ram 20 Tb+ of various SSD and HDD drives
It is good enough for prototyping, and when I need something beefier, I use AWS or GCP. Both work well. Sometimes I try smaller hostings; for example, I had a positive experience with the Hostkey.
Recently I had a conversation with the CEO of Q blocks. They have the initiative to give free compute to active Kagglers. Feel free to reach them out.
Q: There is a lot of hype around machine learning. So, when the dust settles down, what will stand the test of time?
Machine learning is maturing. Five years ago, it was all about Research advancements and what is possible in principle. It was a phase of active exploration, the era of hype. Companies were interested in Researchers. Math, statistics, the theory behind ML algorithms were the things that recruiters were looking for.
Today the hype is fading away; the field becomes mature. It is the beginning of the era of exploitation. The question is not what ML can do, but how to use it to make an impact. Companies are much less interested in researchers, but rather than in Machine Learning engineers: strong Software Engineers that know some machine learning. They may not understand all the theory, but they can build solutions that bring value to the business and bring value to the customers.
Hence I would recommend focusing on software engineering skills. They will not get out of fashion for many decades.
Q: Which domain of AI, do you think, will come out on top in the next 10 years?
I believe the most widely used algorithm in 10 years will be the same as today, and the same as 10 years ago. And this algorithm is logistic regression. :)
I hope that reinforcement learning will advance and will finally have applications in the industry. If it is the case, it will be huge.
Q: What do outsiders get wrong about this field?
People that are not used to ML do not understand that it is not deterministic but probabilistic.
In software engineering, we implement A, and it will be able to do B. It is a commitment, and partner teams can plan their actions based on this information.
In machine learning: we will implement A, it may work with accuracy B, but we are not 100% sure. It may not work at all. We will try approaches C and D, they may work.
This is vague, but it is the best that we can often do. It takes some time to communicate it to others.
Q: What does it take to make a machine learning engineer from GOOD to GREAT? (best practice, workflow management, etc)
To write good code, you need first to write 100500 lines of the bad code. ML is similar. The most skilled people I have seen have at least some computer vision experience, time series, natural language processing, recommender systems, and others. And they train something 24 / 7.
Machine learning competitions are an excellent place to develop your machine learning muscles. At the beginning of my ML journey, I jumped on every competition that came along. It was a brilliant decision.
It is also essential to be a good programmer. The better your code is, the more productive you are. The best way is to join the company with high code standards and learn from your colleagues. But if you are still in ace media, you may follow advice from my article Nine Simple Steps For A Better-looking Python Code. It could be an excellent first step.
Q: Any additional tips for the beginners? Book/blog recommendations etc
My advice for beginners is to write blog posts. You learned something new. It could be something very basic. Write a blog post about this.
Explaining the material to others is a great way to solidify your knowledge. The material that I was teaching to students at the University, even now I remember better than the scientific work that I published. If no one reads this, it will still help you. But most likely others will benefit from your explanation, which is good for your Karma :)