First ML Model
Hello everyone. First of all, happy new year! We hope it's been a good one so far (and continue like that)!
So, we're excited to share the progress of our project. Since our last lab note in early December we've made significant strides in data handling and preparation of the "Movement Sensor Dataset for Dog Behavior Classification" (vehkaoja, et. al., 2021).
In that ocassion we mentioned "the idea of re-producing their (Kumpulainen et al. 2021) analysis in order to better understand the dataset structure, before moving towards an edge approach. And this will most likely be the topic of our next labnote." Well, this is the next labnote so let's start with that.
Reproduction Attempt
We will cut it short by saying that we have not been able to reproduce the full data processing pipeline as originally done by the authros. This was due mainly to our limited time to become proficient with MatLab, which was the framework they used for the analysis and the sharing of the result. However, the very process of studying the available documentation and code was already very fruitful and informative for the data handling and the prototyping of our first ML model in the Edge Impulse platfform (the main framework we will be using so that the ML model can be easily deployed in an edge device).
Data Optimization
The first noticiable feature of the "Movement Sensor Dataset for Dog Behavior Classification" (vehkaoja, et. al., 2021) is its size: 2GB. This represented the first challenge in making it suitable to the Edge Impulse platform, as it has an upload limit of 100mb under its free plan.

Fortunately, constraints are important features of creation so this settleback made us consider what aspects of the original dataset are most relevant to our project, so as to reduce it. And many insights followed:
First of all, we filtered out non-essential entries. One curious feature of the dataset is the it has three collumns specifying dogs behaviors. In fact, four if we were to count a collumns entitled "Task". By looking at the data, it became clear that "Behavior_1" held the most relevant data. Often the other collumns would have the same entry as Behavior_1 or something similar. From our interpretation, they added layers of specificity to the main behavior under Behavior_1. So we started by filtering out the other ones, already significantly reducing the dataset size. Then, within Behavior_1 we noticed that many rows had an "<undefined>" entry, so we filtered them out too. Still, we were left with the following categories: Bowing, Carrying object, Drinking, Eating, Galloping, Jumping, Lying chest, Pacing, Panting, Playing, Shaking, Sitting, Sniffing, Standing, Synchronization, Trotting, Tugging, and Walking. And the huge variety indicates that the dataset could be even smaller, but for now we've decided to leave it as it is, also to better understand what predictions would work best in practice.
A side note worth mentioning is that this was all done via python scripts, designed with the help of ChatGPT. All snipts have been made available in the project's Github repository.

After this was done we realized another very relevant topic, which we had mentioned in the previous lab note: the different kinds of dogs (breeds, sizes, etc) that were part of the original data collection. 44 different dogs participated in the data collection and while that gave rise to a rich dataset, some of these dogs had significantly different characteristics when compared to Maniçoba, our companion dog for the testing phase of this research. So we decided to reduce the dataset even further by selecting the data from the smallest dogs, in terms of weight. This has reduced our workspace to data gathered from Border Collies and crossbreeds (like Maniçoba).i And this is a very important consideration to anyone interested in replicating our methods - the dataset used to train the ML model should be adaquated to participating dog's size.

Finally, in order to reduce even further the dataset and meet edge impulse's limit - but also to optimize our own workflow - we did another data manipulation: the slip of the already reduced dataset into 2. This was so because the original data gathering happened with sensors located in 2 different parts of the dogs' bodies: their neck and back. So it was actually a necessary step to separate these two so as to train separate ML models dedicated to predictions in each one of these body parts. And that's what we did. This way we ended up with two final datasets: "(neck)DogMoveData_dogids.csv" and "(back)DogMoveData_dogids.csv", each with roughly 90mb in size. So we were ready to go to Edge Impulse Studio. And to get started, we decided to focus on the neck sensing approach.
Machine Learning - Edge Impulse
"Edge Impulse empowers you to bring intelligence to your embedded projects by enabling devices to understand and respond to their environment. Whether you want to recognize sounds, identify objects, or detect motion, Edge Impulse makes it accessible and straightforward."
As planned since the project's conception, Edge Impulse is the framework we've chosen for the training of edge compatible ML models, mainly for its user-friendlyness - characterized by things like a bult-in feature to split uploaded dataset into training and testing subsets. Unfortunately though, this didn't work out of the box in our case and we had to manually split the dataset.

In our case we used Python's pandas and scikit-learn libraries to achieve a 80/20 split, employing random sampling to ensure an unbiased and representative division - fixed random seed set to 42. We also used stratified sampling to maintain class balance, addressing the challenge of potential dataset imbalance.
This then created pretty satisfactory results, which could be well interpreted by the Edge Impulse platform (with the help of its CSV Wizard feature). It took time though: more than 5 hours of cloud processing.

After this the next step was to actually train a first model - or design an impulse in the Edge Impulse's language. Such an impulse design is facilitated by an intuitive blocks-based interface as I will describe.
The first block concers the specificiation of the training dataset in terms of its inputs and other characteristics. And this was the moment in which the MatLab documentation provided by (Kumpulainen et al. 2021) was valuable. In a nutshel, we were able to use the same parameters as they did in the original analysis of the data (e.g. WLen = 200; % Window 2 sec, overlap = 50%) for 6 different data input from the dataset: ANeck_x, ANeck_y, ANeck_z, GNeck_x, GNeck_y, GNeck_z. Translating, these are the Accelerometer and Gyroscope triaxial readings.
The main difference in our approach was regarding the Feature Extraction method used - which has another dedicated block. (Kumpulainen et al. 2021) relied on an Empirical Cumulative Distribution Function (ECDF), while Edge Impulse offers a Spectral feature approach. Below ChatGPT gives his thoughts on their differences:

After the feature extraction, the next block in the edge impulse platform concers the Classification of the data. We chose a Classification learning block, which is Neural Network Classifier based on the Keras library. It learns patterns from data, and apply these to new data. Great for categorizing movement or recognizing audio.
"The basic idea is that a neural network classifier will take some input data, and output a probability score that indicates how likely it is that the input data belongs to a particular class."
And then we just needed to chose which categories we want the model to be able to predict in the Outputs block. In our case, all the different entries for "Behavior_1".

This was the overal structure of our model, from which we were able to do a first training. But before we reveal the results, it's worth highlighting that the Spectral Analysis and Classification blocks have further parameters that can be tweaked and which affect both the model training time and efficiency.


The issue we have encountered however is the compute time limit associated to the free account at Edge Impulse.

So for our first training attempt, we chose the least resource intense parameters in the Classifier block (e.g. going from the default 100 training epochs, to 10). This has allowed us to train a first model with relatively satisfactory results:

Unfortunately though, the quantized (int8) version of this same model (what would be the ideal version of the edge deployment, due to reduced model's size and faster inference times), performs really badly:

So now we stand at a conjuction of understanding the best approach moving forward, which will either involve the signing up to a paid plan in the Edge Impulse platform (something we hadn't budged for) or the smart chosing of parameters for a higher accuracy model, especially in a quantized version.
We will be delivering an answer in the next labnote. Until then, if you have read thus far, we're grateful for your support and interest in our journey to bridge interspecies communication gaps. Thank you very much!
0 comments